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International application No. PCT/IE00/00083 



I. Basis of the report 

1 . With regard to the elements of the international application (Replacement sheets which have been furnished to 
the receiving Office in response to an invitation under Article 14 are referred to in this report as "originally filed" 
and are not annexed to this report since they do not contain amendments (Rules 70. 16 and 70. 1 7)f. 
Description, pages: 

1 -63 as originally filed 

Claims, No.: 

1-18 as received on 1 1/07/2001 with letter of 05/07/2001 

Drawings, sheets: 

1-7 as originally filed 



2. With regard to the language, all the elements marked above were available or furnished to this Authority in the 
language in which the international application was filed, unless otherwise indicated under this item. 

These elements were available or furnished to this Authority in the following language: , which is: 

□ the language of a translation furnished for the purposes of the international search (under Rule 23.1 (b)). 

□ the language of publication of the international application (under Rule 48.3(b)). 

□ the language of a translation furnished for the purposes of international preliminary examination (under Rule 
55.2 and/or 55.3). 

3. With regard to any nucleotide and/or amino acid sequence disclosed in the international application, the 
international preliminary examination was carried out on the basis of the sequence listing: 

□ contained in the international application in written form. 

□ filed together with the international application in computer readable form. 

□ furnished subsequently to this Authority in written form. 

□ furnished subsequently to this Authority in computer readable form. 

□ The statement that the subsequently furnished written sequence listing does not go beyond the disclosure in 
the international application as filed has been furnished. 

□ The statement that the information recorded in computer readable form is identical to the written sequence 
listing has been furnished. 

4. The amendments have resulted in the cancellation of: 

□ the description, pages: 

□ the claims, Nos.: 
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□ the drawings, sheets: 

5. □ This report has been established as if (some of) the amendments had not been made, since they have been 

considered to go beyond the disclosure as filed (Rule 70.2(c)): 

(Any replacement sheet containing such amendments must be referred to under item 1 and annexed to this 
report,) 

6. Additional observations, if necessary: 

III. Non-establishment of opinion with regard to novelty, inventive step and industrial applicability 

1. The questions whether the claimed invention appears to be novel, to involve an inventive step (to be non- 
obvious), or to be industrially applicable have not been examined in respect of: 

□ the entire international application. 
E3 claims Nos. 16-18. 

because: 

□ the said international application, or the said claims Nos. relate to the following subject matter which does 
not require an international preliminary examination (specif/): 

H the description, claims or drawings (indicate particular elements beloW) or said claims Nos. 16-18 are so 
unclear that no meaningful opinion could be formed (specify): 
see separate sheet 

□ the claims, or said claims Nos. are so inadequately supported by the description that no meaningful opinion 
could be formed. 

□ no international search report has been established for the said claims Nos. . 

2. A meaningful international preliminary examination cannot be carried out due to the failure of the nucleotide 
and/or amino acid sequence listing to comply with the standard provided for in Annex C of the Administrative 
Instructions: 

□ the written form has not been furnished or does not comply with the standard. 

□ the computer readable form has not been furnished or does not comply with the standard. 

V. Reasoned statement under Article 35(2) with regard to novelty, inventive step or industrial applicability; 
citations and explanations supporting such statement 

1. Statement 

Novelty (N) Yes: Claims 2, 4-8, 10 
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No: 



Claims 



3, 9, 11-15 



Inventive step (IS) 



Yes: 
No: 



Claims 
Claims 



2 

4-8, 10 



Industrial applicability (IA) Yes: 

No: 



Claims 
Claims 



1-15 



2. Citations and explanations 
see separate sheet 

VII. Certain defects in the international application 

The following defects in the form or contents of the international application have been noted: 
see separate sheet 
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EXAMINATION REPORT - SEPARATE SHEET 

The examination is being carried out on the following application documents: 
Text for the Contracting States: 

AT BE CH DE DK ES Fl FR GB GR IT IE LI LU MC NL PT SE 
Description, pages: 

1-63 as originally filed 

Claims, No.: 

1-18 as received on 1 1/07/2001 with letter of 05/07/2001 

Drawings, sheets: 

1-7 as originally filed 



Re Item III 

Non-establishment of opinion with regard to novelty, inventive step and 
industrial applicability 

Claims 1 6-1 8 do not meet the requirements of Article 6 PCT in that the matter for which 
protection is sought is not clearly defined. Because of the following iack of clarity a 
reasoned statement with regard to novelty and inventive step for these claims is not 
possible: 

In contrast to the requirements of Article 6 PCT the independent claim 1 6 does not 
clearly define all features necessary for the definition of the invention, i.e. for allowing 
logic event simulation. Neither do its dependent claims. 
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Re Item V 

Reasoned Statement under Article 35 (2) with regard to novelty, inventive step or 
industrial applicability; citations and explanations supporting such statement 

Reference is made to the following documents: 

D1 : DALTON D: 'AN ASSOCIATIVE MEMORY APPROACH TO PARALLEL 
LOGIC EVENT-DRIVEN SIMULATION' PROCEEDINGS OF THE ANNUAL 
EUROPEAN CONFERENCE ON COMPUTER SYSTEMS AND SOFTWARE 
ENGINEERING (COMPEURO),US,LOS ALAMITOS, IEEE COMP. SOC. 
PRESS, vol. CONF. 6, 4 May 1992 (1992-05-04), pages 341-346, 
XP000344219 ISBN: 0-8186-2760-3 

D2: DALTON D: 'A special purpose hybrid SIMD processor for logic event 

simulation' PROCEEDINGS OF THE SEVENTH EUROMICRO WORKSHOP 
ON PARALLEL AND DISTRIBUTED PROCESSING. PDP'99, 
PROCEEDINGS OF 7TH EUROMICRO WORKSHOP ON PARALLEL AND 
DISTRIBUTED PROCESSING, FUNCHAL, PORTUGAL, 3-5 FEB. 1999, 
pages 74-83, XP002158195 1999, Los Alamitos, CA, USA, IEEE Comput. 
Soc, USA ISBN: 0-7695-0059-5 

D3: Verteilte Systeme, Michael Weber, Spektrum Akademischer Verlag Berlin 
Heidelberg, 1998, p. 8-9 

The document D3 was not cited in the international search report. A copy of the 
document is appended hereto. 

1 . The subject-matter of claim 1 is not new in the sense of Article 33(2) PCT. 

Documents D1 and D2, which both describe earlier versions of the same system 
APPLES as does the present application, each disclose clearly and unambigously all 
features of claim 1 . 

2. Dependent claims 3-1 5 do not appear to contain any additional features which 
meet the requirements of the PCT with respect to novelty and inventive step. 
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Regarding claim 3: s. D2, section 5.3, 1. 1-7 

Regarding claims 5, 10: s. D1 , p. 342, r. col., I. 27 - p. 343, r. col., I. 41 

Here it is shown a method in which each line signal to a target logic gate is stored 
as a plurality of bits, each bit representing a delay of one time period, the 
aggregate bits representing the inherent delay of each logic gate. If, as is said in 
the description, p. 19, I. 32 - p. 20, 1. 1, as the speed of circuits increases, the time 
taken to transmit a message between two logic gates can be [in comparison] 
considerable, the skilled person can be assumed to realize this fact and to know 
that the aggregate bits then may be interpreted as representing the inherent delay 
of each logic gate together with the corresponding input line, i.e. as also 
representing the delay between signal output and reception by the target logic 
gate. 

Regarding claim 8: s. D2, section 5.3, I. 1-7 in comb, with D2, p. 76, r. col., I. 9- 

12 and D2, p. 80, r. col., I. 1-5 

Regarding claim 9: s. D2, p. 75, r. col., I. 6-14, p. 76, 1, col., I. 4 - r.col. I. 13 

Regarding claim 11: s. D2, section 3.2, 1. 1-9 

Regarding claims 12-14: s. D2, p. 77, I. col., I. 36 - p. 78, I. col., I. 13; D1, p. 342, item 
( v ) 

Regarding claim 15: s. D1, p. 342, r. col., I. 27 - p. 343, r. col., I. 41, D1, fig. 6, D2, fig. 2 

It is clear that in D1 , fig. 6, D2, fig. 2, the associate array 1 b is thought for each 
logic gate to store a record of all values that the logic gate has acquired, this 
including also the logic gate with the longest delay in the circuit. 

Regarding claims 4, 6, 7: 
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As it appears, also the subject matter of these claims is not inventive, for the following 
reasons: 

(i) The inclusion of scan registers in the system of figure 6 of D1 , as shown in figure 
10, item 6a, of the application, is disclosed in D2, section 5.3, I. 1-7. 

(ii) From sections of D2, sections 5.2-5.4, the skilled person knows that 

"the hit density of the [hit-] list has been found to be unaffected by the circuit size 
[i.e. to be about 1%, s. table 3] while the scanning and updating processes is 
amenable to parallelisation. When the access time of the fan-out list is 
commensurate to that of the scan clock [which evidently is possible, see section 
5.3, I. 18-25] linear speedup relative to the number of registers is possible. 
Consequently, by employing more scan registers [i.e. as many as average hits per 
evaluation, s. table 3 and table 4, first row] it is possible to push the scan rate 
towards that of the fan-out memory access rate providing clashes are minimised. 
A low clash rate is achievable through a fast memory access time and the low 
actvity rate of logic circuits [which both is possible resp. given, see above]". 

The skilled person working for further acceleration of logic event simulation will 
apply this knowledge to D2, figure 4, and realize 

(a) that all steps in each gate-type specific execution of a for-loop are 
comparable regarding needed cycles (evidently for the last step of the fan- 
out gates updates in a for-loop only the fraction of the fan-out gates updates 
of the total evaluation, as corresponding to the particular gate-type 
considered in this for-loop, is necessary), and 

(b) that all steps within a for-loop can be made in parallel to any step within 
another for-loop. 

Thus the transformation of the sequential execution of the for-loops to a 
concurrent execution, which is a well-known principle for parallelisation for speed- 
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up, with this knowledge becomes looking feasible for the skilled person and also 
to appear as to be a good, or even the only, way to gain further significant speed- 
up effects. (In addition, the use of several arrays also is proposed as an outlook 
in D2, section 5.4, I. 23-24, in itself leading the skilled person to search for ways of 
parallelisation). 

Therefore, it seems obvious to parallelise the for-loops and to execute each for- 
loop on a different executing unit, which in the present case is a set of associative 
arrays or register banks as shown in D1 , fig. 6. Because results, here produced 
by the respective multiple response resolver and fan-out memory and read in via 
the respective input value bank, must be exchanged between the executing units, 
a means corresponding to item 42 in fig. 10 of the application apparently is 
necessary. 

As well, the skilled person knows grid based topologies and hypercube structures 
for enabling efficient communication between executing units for parallelisation 
embodiments (s. D3, section 1.2.4). 

3. Dependent claim 2, referring to the description, p. 13, I. 14-28, p. 20, I. 4-24, 
appears to contain additional features which meet the requirements of the PCT with 
respect to novelty and inventive step and to solve the problem of holding a number of 
delay units where the corresponding delay word width exceeds the width of the 
associative register 1b. 

Re Item VII 

Certain defects in the international application 

Contrary to the requirements of Rule 5.1 (a)(ii) PCT, the relevant background art 
disclosed in the documents D1 , D2 is not mentioned in the description, nor are 
these documents identified therein. 
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CLAIMS 



1. A parallel processing method of logic simulation comprising representing 
signals on a line over a time period as a bit sequence, evaluating the output of 
5 any logic gate including an evaluation of any inherent delay by a comparison 

between the bit sequences of its inputs to a predetermined series of bit 
patterns and in which those logic gates whose outputs have changed over the 
time period are identified during the evaluation of the gate outputs as real 
gate changes and only those real gate changes are propagated to fan out 

10 gates and in which the control of the method is carried out in an associative 

memory mechanism which stores in word form a history of gate input signals 
by compiling a hit list register of logic gate state changes and using a multiple 
response resolver forming part of the associative memory mechanism which 
generates an address for each hit, and then scans and transfers the results 

15 on the hit list to an output register for subsequent use characterised in that 

the hit list is segmented into a plurality of separate smaller hit lists each 
connected to a separate scan register and in which each scan register is 
operated in parallel to transfer the results to the output register. 

20 2. A method as claimed in claim 1 in which the associative register is divided into 
separate smaller associative sub-registers, one type of logic gate being 
allocated to each sub-register, each of which associative sub-registers has 
corresponding sub-registers connected thereto whereby gate evaluations and 
tests are carried out in parallel on each associative sub-register. 

25 

3. A method as claimed in claim 1 or 2 in which each associative sub-register is 
used to form a hit list connected to a corresponding separate scan register. 

4. A method as claimed in any of claims 1 to 3 in which where the number of the 
30 one type of logic gate exceeds a predetermined number more than one sub- 
register is used. 



5. 



A method as claimed in any preceding claim in which the scan registers are 
controlled by exception logic using an OR gate whereby the scan is 
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terminated for each register on the OR gate changing state thus indicating no 
further matches. 

6. A method as claimed in claim 5 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

repeating the above steps until the hit list is cleared. 

7. A method as claimed in any preceding claim, in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of 
one time period, the aggregate bits representing the delay between signal 
output to and reception by the target logic gate. 

8. A method as claimed in any preceding claim, in which each delay is stored as 
a delay word in an associative memory forming part of the associative 
memory mechanism in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained within the 
delay word is calculated as a gate state; 
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the gate state is stored in a further state register; 



the remainder from the calculation is stored in the associative register with 
those delay words whose widths did not exceed the associative 
5 register word width; and 



on the count of the associative register commencing:- 



the state register is consulted for the delay word entered in the state register 
1 0 and the remainder is ignored for this count of the associative register; 

at the end of the count of the associative register, the state register is 
updated; and 

15 the count continues until the remainder represents the count still required. 

9. A method as claimed in any preceding claim in which there is'an initialisation 
phase in which: 



20 specified signal values are inputted; 

unspecified signal values are set to unknown; 



test templates are prepared defining the delay model for each logic 
25 gate; 

the input circuit is parsed to generate an equivalent circuit consisting 
of 2-input logic gates; and 

30 the 2-input logic gates are then configured. 



10. 



A method as claimed in any preceding claim in which a multi-valued logic is 
applied and in, which n bits are used to represent a signal value at any 
instance in time with n being any arbitrarily chosen logic. 
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11. A method as claimed in claim 1 0 in which an 8-valued logic is used where 000 
represents logic 0, 1 11 represents logic 1 and 001 to 1 10 represent arbitrarily 
defined other signal states. 

5 

12. A method as claimed in claim 10 or 1 1 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 
memory mechanism. 

10 13. A method as claimed in any preceding claim in which there is stored a record 
of all values that a logic gate has acquired for the units of delay of the longest 
delay in the circuit. 



14. A parallel processing method of logic simulation comprising representing 

15 signals on a line over a time period as a bit sequence, evaluating the output of 

any logic gate including an evaluation of any inherent delay by a comparison 
between the bit sequences of its inputs to a predetermined series of bit 
patterns and in which those logic gates whose outputs have changed over the 
time period are identified during the evaluation of the gate outputs as real 

20 gate changes and only those real gate changes are propagated to fan out 

gates and in which the control of the method is carried out in. an associative 
memory mechanism which stores in word form a history of gate input signals 
by compiling a hit list register of logic gate state changes and using a multiple 
response resolver forming part of the associative memory mechanism which 

25 generates an address for each hit, and then scans and transfers the results 

on the hit list to an output register for subsequent use characterised in that the 
associative register is divided into separate smaller associative sub-registers, 
one type of logic gate being allocated to each associative sub-register, each 
of which associative sub-registers has corresponding sub-registers connected 

30 thereto whereby gate evaluations and tests are carried out in parallel on each 

associative sub-register. 



15. 



A method as claimed in claim 1 in which the hit list is segmented into a 
plurality of separate smaller hit lists corresponding to each associative sub- 
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register each smaller hit list is connected to a separate scan register and in 
which each scan register is operated in parallel to transfer the results to the 
output register. 

A method as claimed in claim 1 4 or 1 5 in which where the number of the one 
type of logic exceeds a predetermined number more than one sub-register is 
used. 

A method as claimed in claim 16 in which the scan registers are controlled by 
exception logic using an OR gate whereby the scan is terminated for each 
register on the OR gate changing state thus indicating no further matches. 

A method as claimed in claim 17 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

• "V. 

repeating the above steps until the hit list is cleared. 

A method as claimed in any of claims 14 to 18, in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of 
one time period, the aggregate bits representing the delay between signal 
output to and reception by the target logic gate. 

A method as claimed in any of claims 14 to 19, in which each delay is stored 
as a delay word in an associative memory forming part of the associative 
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memory mechanism in which:- 

the length of the delay word is ascertained; and 

5 if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained 
within the delay word is calculated as a gate state; 

10 the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register 
with those delay words whose widths did not exceed the associative 
register word width; and 

15 

on the count of the associative register commencing:- 



the state register is consulted for the delay word entered in the state 
register and the remainder is ignored for this count of the associative 
20 register; 

at the end of the count of the associative register, the state register is 
updated; and 

25 the count continues until the remainder represents the count still 

required. 

21. A method as claimed in any of claims 14 to 20 in which there is an 
initialisation phase in which: 

30 

specified signal values are inputted; 



unspecified signal values are set to unknown; 
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test templates are prepared defining the delay model for each logic 
gate; 

the input circuit is parsed to generate an equivalent circuit consisting 
5 of 2-input logic gates; and 

the 2-input logic gates are then configured. 

22. A method as claimed in any of claims 14 to 21 in which a multi-valued logic is 
10 applied and in which n bits are used to represent a signal value at any 

instance in time with n being any arbitrarily chosen logic. 

23. A method as claimed in claim 22 in which an 6-valued logic is used where 000 
represents logic 0, 1 11 represents logic 1 and 001 to 1 10 represent arbitrarily 

1 5 defined other signal states. 

24. A method as claimed in claim 22 or 23 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 
memory mechanism. 

20 

25. A method as claimed in any of claims 14 to 24 in which there is stored a 
record of all values that a logic gate has acquired for the units of delay of the 
longest delay in the circuit. 

25 26. A parallel processing method of logic simulation comprising representing 
signals on a line over a time period as a bit sequence, evaluating the output of 
any logic gate by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which those logic gates whose 
outputs have changed over the time period are identified during the evaluation 

30 of the gate outputs as real gate changes and only those real gate changes 

are propagated to fan out gates and in which the control of the method is 
carried out in an associative memory mechanism which stores in word form a 
history of gate input signals by compiling a hit list register of logic gate state 
changes and using a multiple response resolver forming part of the 
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associative memory mechanism which generates an address for each hit, and 
then scans and transfers the results on the hit list to an output register for 
subsequent use characterised in that each line signal to a target logic gate is 
stored as a plurality of bits each representing a delay of one time period, the 
aggregate bits representing the delay between signal output to and reception 
by the target logic gate and in which the inherent delay of each logic gate is 
represented in the same manner. 

A method as claimed in claim 26, in which each delay is stored as a delay 
word in an associative memory forming part of the associative memory 
mechanism in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained 
within the delay word is calculated as a gate state; 

the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register 
with those delay words whose widths did not exceed the associative 
register word width; and 

on the count of the associative register commencing:- 

the state register is consulted for the delay word entered in the state 
register and the remainder is ignored for this count of the associative 
register; 

at the end of the count of the associative register, the state register is 
updated; and 
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the count continues until the remainder represents the count still 
required. 

28. A method as claimed in claim 26 or 27, in which the hit list is segmented into 
a plurality of separate smaller hit lists each connected to a separate scan 
register and in which each scan register is operated in parallel to transfer the 
results to the output register. 

29. A method as claimed in any of claims 26 to 28, in which the scan registers are 
controlled by exception logic using an OR gate whereby the scan is 
terminated for each register on the OR gate changing state thus indicating no 
further matches. 

30. A method as claimed in claim 29 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

repeating the above steps until the hit list is cleared. 

31. A method as claimed in any of claims 26 to 30 in which there is an 
initialisation phase in which: 

specified signal values are inputted; 

unspecified signal values are set to unknown; 
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test templates are prepared defining the delay model for each logic 
gate; 

5 the input circuit is parsed to generate an equivalent circuit consisting 

of 2-input logic gates; and 



the 2-input logic gates are then configured. 



10 32. A method as claimed in any of claims 26 to 31 in which a multi-valued logic is 
applied and in which n bits are used to represent a signal value at any 
instance in time with n being any arbitrarily chosen logic. 



33. A method as claimed in claim 32 in which an 8-valued logic is used where 000 
15 represents logic 0, 111 represents logic 1 and 001 to 110 represent arbitrarily 

defined other signal states. 

34. A method as claimed in claim 32 or 33 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 

20 memory mechanism. 

35. A method as claimed in any of claims 26 to 34 in which there is stored a 
record of all values that a logic gate has acquired for the units of delay of the 
longest delay in the circuit. 

25 

36. A parallel processor for logic event simulation (APPLES) comprising:- 

a main processor; 

30 an associative memory mechanism including a response resolver; 



characterised in that the associative memory mechanism comprises:- 
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a plurality of separate associative sub-registers each for 
the storage in word form of a history of gate input signals 
for a specified type of logic gate; and 

5 a plurality of separate additional sub-registers associated with each 

associative sub-register whereby gate evaluations and tests can be 
carried out in parallel on each associative sub-register. 

37. A processor as claimed in claim 36, in which the additional sub-registers 
10 comprise an input sub-register, a mask sub-register and a scan sub-register. 

38. A processor as claimed in claim 37, in which the scan sub-registers are 
connected to an output register. 



15 
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INTERNATIONAL PRELIMINARY 
EXAMINATION REPORT 
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I. Basis of the report 

1 . With regard to the elements of the international application (Replacement sheets which have been furnished to 
the receiving Office in response to an invitation under Article 14 are referred to in this report as "originally filed" 
and are not annexed to this report since they do not contain amendments (Rules 70. 16 and 70. 1 7)): 
Description, pages: 

1 -63 as originally filed 

Claims, No.: 

1-18 as received on 1 1 /07/2001 with letter of 05/07/200 1 

Drawings, sheets: 

1-7 as originally filed 

2. With regard to the language, all the elements marked above were available or furnished to this Authority in the 
language in which the international application was filed, unless otherwise indicated under this item. 

These elements were available or furnished to this Authority in the following language: , which is: 

□ the language of a translation furnished for the purposes of the international search (under Rule 23.1 (b)). 

□ the language of publication of the international application (under Rule 48.3(b)). 

□ the language of a translation furnished for the purposes of international preliminary examination (under Rule 
55.2 and/or 55.3). 

3. With regard to any nucleotide and/or amino acid sequence disclosed in the international application, the 
international preliminary examination was carried out on the basis of the sequence listing: 

□ contained in the international application in written form. 

□ filed together with the international application in computer readable form. 

□ furnished subsequently to this Authority in written form. 

□ furnished subsequently to this Authority in computer readable form. 

□ The statement that the subsequently furnished written sequence listing does not go beyond the disclosure in 
the international application as filed has been furnished. 

□ The statement that the information recorded in computer readable form is identical to the written sequence 
listing has been furnished. 

4. The amendments have resulted in the cancelation of: 

□ the description, pages: 

□ the claims, Nos.: 
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□ the drawings, sheets: 

5. □ This report has been established as if (some of) the amendments had not been made, since they have been 

considered to go beyond the disclosure as filed (Rule 70.2(c)): 

(Any replacement sheet containing such amendments must be referred to under item 1 and annexed to this 
report.) 

6. Additional observations, if necessary: 

III. Non-establishment of opinion with regard to novelty, inventive step and industrial applicability 

1 . The questions whether the claimed invention appears to be novel, to involve an inventive step (to be non- 
obvious), or to be industrially applicable have not been examined in respect of: 

□ the entire international application. 
H claims Nos. 16-18. 

because: 

□ the said international application, or the said claims Nos. relate to the following subject matter which does 
not require an international preliminary examination (specify): 

IS the description, claims or drawings (indicate particular elements beloW) or said claims Nos. 16-18 are so 
unclear that no meaningful opinion could be formed (specify): 
see separate sheet x 

□ the claims, or said claims Nos. are so inadequately supported by the description that no meaningful opinion 
could be formed. 

□ no international search report has been established for the said claims Nos. . 

2. A meaningful international preliminary examination cannot be carried out due to the failure of the nucleotide 
and/or amino acid sequence listing to comply with the standard provided for in Annex C of the Administrative 
Instructions: 

□ the written form has not been furnished or does not comply with the standard. 

□ the computer readable form has not been furnished or does not comply with the standard. 

V. Reasoned statement under Article 35(2) with regard to novelty, inventive step or industrial applicability; 
citations and explanations supporting such statement 

1. Statement 

Novelty (Nj Yes: Claims 2, 4-8, 10 

. Form PCT/1PEA/409 (Boxes l-VIII, Sheet 2) (July 1998) 



TH!S PAGE BUNK mno. 



INTERNATIONAL PRELIMINARY 
EXAMINATION REPORT 



International application No. PCT/IE00/00083 



No: 



Claims 3, 9, 11-15 



Inventive step (IS) 



Yes: 
No: 



Claims 2 
Claims 4-8, 10 



Industrial applicability (IA) Yes: Claims 1-15 

No: Claims 



2. Citations and explanations 
see separate sheet 



VII. Certain defects in the international application 

The following defects in the form or contents of the international application have been noted: 
see separate sheet 
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The examination is being carried out on the following application documents: 
Text for the Contracting States: 

AT BE CH DE DK ES Fl FR GB GR IT IE LI LU MC NL PT SE 

Description, pages: 

1 -63 as originally filed 

Claims, No.: 

1-18 as received on 1 1/07/2001 with letter of 05/07/2001 

Drawings, sheets: 

1 -7 as originally filed 



Re Item III 

Non-establishment of opinion with regard to novelty, inventive step and 
industrial applicability 

Claims 1 6-18 do not meet the requirements of Article 6 PCT in that the matter for which 
protection is sought is not clearly defined. Because of the following lack of clarity a 
reasoned statement with regard to novelty and inventive step for these claims is not 
possible: 

In contrast to the requirements of Article 6 PCT the independent claim 16 does not 
clearly define all features necessary for the definition of the invention, i.e. for allowing 
logic event simulation. Neither do its dependent claims. 
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Re Item V 

Reasoned Statement under Article 35 (2) with regard to novelty, inventive step or 
industrial applicability; citations and explanations supporting such statement 

Reference is made to the following documents: 

D1 : DALTON D: 'AN ASSOCIATIVE MEMORY APPROACH TO PARALLEL 
LOGIC EVENT-DRIVEN SIMULATION' PROCEEDINGS OF THE ANNUAL 
EUROPEAN CONFERENCE ON COMPUTER SYSTEMS AND SOFTWARE 
ENGINEERING (COMPEURO),US,LOS ALAMITOS, IEEE COMP. SOC. 
PRESS, vol. CONF. 6, 4 May 1992 (1992-05-04), pages 341-346, 
XP00034421 9 ISBN: 0-8186-2760-3 

D2: DALTON D: 'A special purpose hybrid SIMD processor for logic event 

simulation' PROCEEDINGS OF THE SEVENTH EUROMICRO WORKSHOP 
ON PARALLEL AND DISTRIBUTED PROCESSING. PDP'99, 
PROCEEDINGS OF 7TH EUROMICRO WORKSHOP ON PARALLEL AND 
DISTRIBUTED PROCESSING, FUNCHAL, PORTUGAL, 3-5 FEB. 1999, 
pages 74-83, XP002158195 1999, Los Alamitos, CA, USA, IEEE Comput. 
Soc, USA ISBN: 0-7695-0059-5 

D3: Verteilte Systeme, Michael Weber, Spektrum Akademischer Verlag Berlin 
Heidelberg, 1998, p. 8-9 

The document D3 was not cited in the international search report. A copy of the 
document is appended hereto. 

1 . The subject-matter of claim 1 is not new in the sense of Article 33(2) PCT. 

Documents D1 and D2, which both describe earlier versions of the same system 
APPLES as does the present application, each disclose clearly and unambigously all 
features of claim 1 . 

2. Dependent claims 3-1 5 do not appear to contain any additional features which 
meet the requirements of the PCT with respect to novelty and inventive step. 
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Regarding claims 5, 10: s. D1, p. 342, r. col., I. 27 - p. 343, r. col., I. 41 

Here it is shown a method in which each line signal to a target logic gate is stored 
as a plurality of bits, each bit representing a delay of one time period, the 
aggregate bits representing the inherent delay of each logic gate. If, as is said in 
the description, p. 19, I. 32 - p. 20, I. 1 , as the speed of circuits increases, the time 
taken to transmit a message between two logic gates can be [in comparison] 
considerable, the skilled person can be assumed to realize this fact and to know 
that the aggregate bits then may be interpreted as representing the inherent delay 
of each logic gate together with the corresponding input line, i.e. as also 
representing the delay between signal output and reception by the target logic 
gate. 

Regarding claim 8: s. D2, section 5.3, I. 1-7 in comb, with D2, p. 76, r. col., I. 9- 



Regarding claim 9: s. D2, p. 75, r. col., I. 6-14, p. 76, I. col., I. 4 - r.col. I. 13 
Regarding claim 1 1: s. D2, section 3.2, I. 1-9 

Regarding claims 12-14: s. D2, p. 77, I. col., I. 36 - p. 78, I. col., I. 13; D1, p. 342, item 



Regarding claim 15: s. D1, p. 342, r. col., I. 27 - p. 343, r. col., I. 41, D1, fig. 6, D2, fig. 2 

It is clear that in D1 , fig. 6, D2, fig. 2, the associate array 1b is thought for each 
logic gate to store a record of all values that the logic gate has acquired, this 
including also the logic gate with the longest delay in the circuit. 

Regarding claims 4, 6, 7: 



Regarding claim 3: 



s. D2, section 5.3, I. 1-7 



12 and D2, p. 80, r. col., I. 1-5 



(v) 
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As it appears, also the subject matter of these claims is not inventive, for the following 
reasons: 

(i) The inclusion of scan registers in the system of figure 6 of D1, as shown in figure 
10, item 6a, of the application, is disclosed in D2, section 5.3, I. 1-7. 

(ii) From sections of D2, sections 5.2-5.4, the skilled person knows that 

"the hit density of the [hit-] list has been found to be unaffected by the circuit size 
[i.e. to be about 1%, s. table 3] while the scanning and updating processes is 
amenable to parallelisation. When the access time of the fan-out list is 
commensurate to that of the scan clock [which evidently is possible, see section 
5.3, I. 18-25] linear speedup relative to the number of registers is possible. 
Consequently, by employing more scan registers [i.e. as many as average hits per 
evaluation, s. table 3 and table 4, first row] it is possible to push the scan rate 
towards that of the fan-out memory access rate providing clashes are minimised. 
A low clash rate is achievable through a fast memory access time and the low 
actvity rate of logic circuits [which both is possible resp. given, see above]". 

The skilled person working for further acceleration of logic event simulation will 
apply this knowledge to D2, figure 4, and realize 

(a) that all steps in each gate-type specific execution of a for-loop are 
comparable regarding needed cycles (evidently for the last step of the fan- 
out gates updates in a for-loop only the fraction of the fan-out gates updates 
of the total evaluation, as corresponding to the particular gate-type 
considered in this for-loop, is necessary), and 

(b) that all steps within a for-loop can be made in parallel to any step within 
another for-loop. / 

Thus the transformation of the sequential execution of the for-loops to a 
concurrent execution, which is a well-known principle for parallelisation for speed- 
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up, with this knowledge becomes looking feasible for the skilled person and also 
to appear as to be a good, or even the only, way to gain further significant speed- 
up effects. (In addition, the use of several arrays also is proposed as an outlook 
in D2, section 5.4, I. 23-24, in itself leading the skilled person to search for ways of 
parallelisation). 

Therefore, it seems obvious to parallelise the for-loops and to execute each for- 
loop on a different executing unit, which in the present case is a set of associative 
arrays or register banks as shown in D1 , fig. 6. Because results, here produced 
by the respective multiple response resolver and fan-out memory and read in via 
the respective input value bank, must be exchanged between the executing units, 
a means corresponding to item 42 in fig. 10 of the application apparently is 
necessary. 

As well, the skilled person knows grid based topologies and hypercube structures 
for enabling efficient communication between executing units for parallelisation 
embodiments (s. D3, section 1.2.4). 

3. Dependent claim 2, referring to the description, p. 13, I. 14-28, p. 20, L 4-24, 
appears to contain additional features which meet the requirements of the PCT with 
respect to novelty and inventive step and to solve the problem of holding a number of 
delay units where the corresponding delay word width exceeds the width of the 
associative register 1b. 

Re Item VII 

Certain defects in the international application 

Contrary to the requirements of Rule 5.1 (a)(ii) PCT, the relevant background art 
disclosed in the documents D1, D2 is not mentioned in the description, nor are 
these documents identified therein. 
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"Logic Event Simulation" 



Introduction 

5 The present invention is directed towards a parallel processing method of logic 
simulation comprising representing signals on a line over a time period as a bit 
sequence, evaluating the output of any logic gate including an evaluation of any 
inherent delay by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which those logic gates whose outputs 

10 have changed over the time period are identified during the evaluation of the gate 
outputs as real gate changes and only those real gate changes are propagated to fan 
out gates and in which the control of the method is carried out in an associative 
memory mechanism which stores in word form a history of gate input signals by 
compiling a hit list register of logic gate state changes and using a multiple response 

15 resolver forming part of the associative memory mechanism which generates an . 
address for each hit, and then scans and transfers the results on the hit list to an 
output register for subsequent use. The output register may contain the final result 
of the simulation or may be a list of outputs to be used for subsequent fan out to 
other gates. Further, the invention is directed towards providing a parallel processor 

20 for logic event simulation (APPLES). 

Logic simulation plays an important role in the design and validation of VLSI circuits. 
As circuits increase in size and complexity, there is an ever demanding requirement 
to accelerate the processing speed of this design tool. Parallel processing has been 

25 perceived in industry as the best method to achieve this goal and numerous parallel 
processing systems have been developed. Unfortunately, large speedup figures 
have eluded these approaches. Higher speedup figures have been achieved, but 
only by compromising the accuracy of the gate delay model employed in these 
systems. A large communication overhead due to basic passing of values between 

30 processors, elaborate measures to avoid or recover from deadlock and load 
balancing techniques, is the principal barrier. 



The ever-expanding size of VLSI (Very Large Scale Integration) circuits has further 
emphasised the need for a fast and accurate means of simulating digital circuits. A 
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compromise between model accuracy and computational feasibility is found in logic 
simulation. In this simulation paradigm, signal values are discrete and may acquire 
in the simplest case logic values 0 and 1. More complex transient state signal 
values are modelled using up to 9-state logic. Logic gates can be modelled as ideal 
components with zero switching time or more realistically as electronic components 
with finite delay and switching characteristics such as inertial, pure or ambiguous 
delays. 

Due to the enormity of the computational effort for large circuits, the application of 
parallel processing to this problem has been explored. Unfortunately, large 
speedup performance for most systems and approaches have been elusive. 

Sequential (uni-processor) logic simulation can be divided into two broad 
categories Compiled code and Event-driven simulation (Breur et al: Diagnosis and 

15 Reliable Design of Digital Systems. Computer-Science Press, New York (1976)). 
These techniques can be employed in a parallel environment by partitioning the 
circuit amongst processors. In compiled code simulation, all gates are evaluated at 
all time steps, even if they are not active. The circuit has to be levellised and only 
unit or zero delay models can be employed. Sequential circuits also pose difficulties 

20 for this type of simulation. A compiled code mechanism has been applied to several 
generations of specialised parallel hardware accelerators designed by IBM. the 
Logic Simulation Machine LSM (Howard et al: Introduction to the IBM Los Gatos 
Simulation Machine. Proc IEEE Int. Conf. Computer Design: VLSI in Computers. 
(Oct 1983) 580-583), the Yorktown Simulation Engine (Pfister: The Yorktown 

25 Simulation Engine. Introduction 19'* ACM/IEEE Design Automation Conf, (June 
1982), 51-54) and the Engineering Verification Engine EVE (Dunn: IBM's 
Engineering Design System Support for VLSI Design and Verification. IEEE 
Design and Test Computers, (February 1984) 30-40 and performance figures as 
high as 2.2 billion gate evaluations/sec reported. Agrawal et al: Logic Simulation 

30 and Parallel Processing Intl Conf on Computer Aided Design (1 990), have analysed 
the activity of several circuits and their results have indicated that at any time 
instant circuit activity (i.e. gates whose outputs are in transition) is typically in the 
range 1% to 0.1%. Therefore, the effective number of gate evaluations of these 
engines is likely to be smaller by a factor of a hundred or more. Speedup values 
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ranging from 6 to 13 for various compiled coded benchmark circuits have been 
observed on the shared memory MIMD Encore Multimax multiprocessor by Soule 
and Blank: Parallel Logic Simulation on General purpose machines. Proc Design 
Automation Conf, (June 1988), 166-171. A SIMD (array) version was investigated 
by Kravitz (Mueller-Thuns et al: Benchmarking Parallel Processing Platforms: An 
Application Perspective. IEEE Trans on Parallel and Distributed systems, 4 No. 8 
(Aug 1993) with similar results. 

The intrinsic unit delay model of compiled code simulators is overly simplistic for 
many applications. 

Some delay, model limitations of compiled code simulation have been eliminated in 
parallel event-driven techniques. These parallel algorithms are largely composed of 
two phases; a gate evaluation phase and an event-scheduling phase. The gate 
evaluation phase identifies gates that are changing and the scheduling phase puts 
the gates affected by these changes (the fan-out gates) into a time-ordered linked 
schedule list, determined by the current time and the delays of the active gates. 
Soule and Blank: Parallel Logic Simulation on General purpose machines. Proc 
Design Automation Conf, (June 1988), 166-1 71 and Mueller-Thuns et al: 
Benchmarking Parallel Processing Platforms: An Application Perspective. IEEE 
Trans on Parallel and Distributed systems, 4 No 8 (Aug 1 993) have investigated both 
Shared and Distributed memory Synchronous event MIMD architectures. Again, 
overall performance has been disappointing the results of several benchmarks 
executed on an 8-processor Encore Multimax and an 8-processor iPSC-Hypercube 
only gave speedup values ranging from 3 to 5. 

Asynchronous event simulation permits limited processor autonomy. Causality 
constraints require occasional synchronisation between processors and rolling back 
of events. Deadlock between processors must be resolved. Chandy, Misra: 
Asynchronous Distributed Simulation via Sequence of parallel Computations. Comm 
ACM 24(ii) (April 1981), 198-206 and Bryant: Simulation of Packet Communications 
Architecture Computer Systems. Tech report MIT-LCS-TR-188. MIT Cambridge 
(1977) have developed deadlock avoidance algorithms, while Briner: Parallel Mixed 
Level Simulation of Digital Circuits Virtual Time. Ph.D. thesis. Dept of El. Eng. Duke 
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University, (1990) and Jefferson: Virtual time. ACM Trans Programming languages 
systems, (July 1985) 404-425 have explored algorithms based on deadlock recovery. 
The best speedup performance figures for Shared and Distributed memory 
asynchronous MIMD systems were 8.5 for a 14-processor system and 20 for a 32- 
5 processor BBN system. 

Optimising strategies such as load balancing, circuit partitioning and distributed 
queues are necessary to realise the best speedup figures. Unfortunately, these 
mechanisms themselves contribute large Overhead communication costs for even 
10 modest sized parallel systems. Furthermore, the gate evaluation process despite 
its small granularity, incurs between 10 to 250 machine cycles per gate evaluation. 

Statements of Invention 

15 The invention comprises a method and a processor for an Associated Parallel 
Processor for Logic Event Simulation; the processor is referred to in this 
specification as APPLES, and is specifically designed for parallel discrete event logic 
simulation and for carrying out such a parallel processing method. In summary, the 
invention provides gates evaluations in memory and replaces interprocessor 

20 communication with a scan technique. Further, the scan mechanism is so arranged 
as to facilitate parallelisation and a wide variety of delay models may be used. 

Essentially, there is therefore provided a parallel processing method of logical 
simulation comprising representing signals on a line over a time period as a bit 

25 sequence, evaluating the output of any logic gate including an evaluation of any 
inherent delay by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which those logic gates whose outputs 
have changed over the time period are identified during the evaluation of the gate 
outputs as real gate changes and only those real gate changes are propagated to fan 

30 out gates. The control of the method is carried out in an associative memory 
mechanism which stores in word form a history of gate input signals by compiling a 
hit list register of logic gate state changes and using a multiple response resolver 
forming part of the associative memory mechanism which generates an address for 
each hit, and then scans and transfers the results on the hit list to an output register 
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for subsequent use. 

One of the core features of the invention is the segmentation or division of at least 
one of the registers or hit lists into smaller registers or hit lists to reduce 
5 computational time. The other feature of considerable importance is the handling of 
line signal propagation by modelling signal delays. Finally the method according to 
the invention allows simulation to be carried out over arbitrarily chosen time periods. 

Either the associated register is divided into separate smaller associative sub- 
10 registers, one type of logic gate being allocated to each associative sub-register, 
each of which associative sub-registers has corresponding sub-registers connected 
thereto whereby gate evaluations and tests are carried out in parallel on each 
associative sub-register. 

15 Alternatively it is possible to achieve a satisfactory simulation particularly where the 
circuit being simulated is not too large by segmenting the hit list into a plurality of 
separate smaller hit lists each connected to a separate scan register in this case 
each scan register is operated in parallel to transfer the results to the output register. 
This gets over the particular computational problem in these parallel processors and 

20 speeds up the whole simulation considerably. 

Further, the invention provides a parallel processor for logic event simulation 
(APPLES) which essentially has an associated memory mechanism which comprises 
a plurality of separate associative sub-registers each for the storage in word form of a 
25 history of gate input signals for a specified type of logic gate. Further, there is a 
number of separate additional sub-registers associated with each associative sub- 
register whereby gate evaluations and tests can be carried out in parallel on each 
associative sub-register. 

30 In the method according to the invention, each associative sub-register is used to 
form a hit list connected to a corresponding separate scan register. 

Ideally, when there are a number of sub-registers and the number of the one type of 
logic gate exceeds a predetermined number, more than one sub-register is used. 



o - — -o 



wo 01/01298 ' — per/i E oo/ooo83 



-6- 

Ideally, the scan registers are controlled by exception logic using an OR gate 
whereby the scan is terminated for each register on the OR gate changing state thus 
indicating no further matches. The predetermined number will be determined by the 
computational load. 

5 

The scan can be carried out in many ways but one of the best ways of carrying it out 
is by sequential counting through the hit list and when this is done, generally the 
steps are performed of> 

10 checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

15 

clearing the bit in the hit list; 
moving to the next position in the hit list; and 
20 repeating the above steps until the hit list is cleared. 

Obviously where fan out occurs subsequently more than one address will be effected. 



25 



30 



In one particular embodiment of the invention, there is provided such a parallel 
processing method of logic simulation in which each line signal to a target logic gate 
is stored as a plurality of bits each representing a delay of one time period, the 
aggregate bits representing the delay between signal output to and reception by the 
target logic gate and in which the inherent delay of each logic gate is represented in 
the same manner. The time period is arbitrarily chosen and will often be of the order 
of 1 nanosecond or less. The fact that the time period can be arbitrarily chosen is of 
immense importance since it is possible to simulate a circuit for a plurality of different 
time periods. Additionally the affect of the delay inherent in the transfer of line signal 
between logic gates is becoming more important as the response time of the 
components of circuits reduce. 



WO 01/01298 



-7- 

ln this latter embodiment, each delay is stored as a delay word in an associative 
memory forming part of the associative memory mechanism in which:- 

5 the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained within the 
1 0 delay word is calculated as a gate state; 

the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register with 
15 those delay words whose widths did not exceed the associative register word 

width; and 

on the count of the associative register commencing:- 

20 the state register is consulted for the delay word entered in the state register 

and the remainder is ignored for this count of the associative register; 

at the end of the count of the associative register, the state register is 
updated; and 

25 

the count continues until the remainder represents the count still required. 

For carrying out the invention, an initialisation phase is carried out in which 
specified signal values are inputted, unspecified signal values are set to unknown, 
30 test templates are prepared defining the delay model for each logic gate, the input 
circuit is parsed to generate an equivalent circuit consisting of 2-input logic gates, and 
the 2-input logic gates are then configured. 
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With the present invention, multi-valued logic may be applied and in this situation, n 
bits are used to represent a signal value at any instance in time with n being any 
arbitrarily chosen logic. A particularly suitable one is an 8-valued logic in which 000 
represents logic 0, 1 1 1 represents logic 1 and 001 to 1 10 represent arbitrarily defined 
5 other signal states. 

One of the features of the invention is that the sequence of values on a logic gate is 
stored as a bit pattern forming a unique word in the associative memory mechanism 
and by doing this it is possible to store a record of all values that a logic gate has 
10 acquired for the units of delay of the longest delay in the circuit 

Detailed Description of the Dnverction 

The invention will be more clearly understood from the following description of 
15 embodiments thereof given by way of example only with reference to the 
accompanying drawings in which:- 



Fig. 1 illustrates the functions of blocks of the APPLE processor; 



Fig. 2 illustrates the inertial delay mechanism in the APPLE system; 



Fig. 3 is an illustration of a simulated cycle; 



Fig. 4 is a test search pattern; 



30 



Fig. 5 is an illustration of the logical combination mechanism according to 
the invention, 



Fig. 6 illustrates components active during a gate evaluation phase, 



Fig. 7 is bit patterns for an ambiguous delay model and hazard detection, 



Fig. 8 is an outline of an alternative arrangement of processors according to 
the invention; 
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Fig. 9 illustrates the structure of one processor in more detail; and 



5 



Fig. 10 is a view similar to Fig. 1 of the alternative construction of processor. 



The essential elemental tasks for parallel logic simulation are: 



10 



1. 



Gate evaluation. 



3. 



2. 



Delay model implementation. 
Updating fan-out gates. 



The design framework for a specific parallel logic simulation architecture originated 
15 by identifying the essential elemental simulation operations which can be performed 
in parallel and by minimising the tasks that support these operations and which are 
totally intrinsic to the parallel system. 

Activities such as event scheduling and load balancing are perceived as 
20 implementation issues which need not be incorporated necessarily into a new 
design. An important additional critique is that the design must execute directly in 
hardware as many parallel tasks as possible, as fast as possible but without limiting 
the type of delay model. 

25 The present invention, taking account of the above objectives, incorporates several 
special associative memory blocks and hardware in the APPLES architecture. 

The gate evaluation/delay model implementation and Update/Fan-out process will 
be explained with reference to the APPLES architecture with reference to Fig. 1. 



Referring to Fig. 1 , the functional blocks of the APPLES processor are shown. The 
blocks pertinent to gate evaluation are associative array 1a 1, input-value-register 
bank 2, associative array 1b, test-result-register bank 4, group-result register bank 
5 and the group-test hit list 6. The group test hit list in turn feeds a multiple 
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response resolver 7 which in turn feeds a fan out memory 8 to an address register 
9 connected to the input value register bank 2. The associative array 1 has an 
associative mask register 1a and input register 1a while the associative array 1b 
has a mask register 1b and an input register 1b. Similarly, the test result register 
5 bank 4 has a result active register 14 and the group result register bank 5 has a 
mask register 15 and an input register 16. Finally, an input value register bank 17 
is provided. Apart from the associative arrays, the group-result register bank has 
parallel search facilities. Regardless of the number of words in these structures 
can be searched in parallel in constant time. Furthermore, the words in the input- 
10 value-register bank 17 and associative array 1b can be shifted right in parallel while 
resident in memory. 

A gate can be evaluated once its input wire values are known. In conventional uni- 
processor and parallel systems these values are stored in memory and accessed by 

15 the processor(s) when the gate is activated. In APPLES, gate signal values are 
stored in associative memory words. The succession of signal values that have 
appeared on a particular wire over a period of time are stored in a given associative 
memory word in a time ordered sequence. For instance, a binary value model could 
store in a 32-bit word, the history of wire values that have appeared over the last 32 

20 time intervals. Gate evaluation proceeds by searching in parallel for appropriate 
signal values in associative memory. Portions of the words which are irrelevant (e.g. 
only the 4 most recent bits are relevant for a 4-unit gate delay model) are masked out 
of the search by the memory's input and mask register combination. For a given gate 
type (e.g. And, Or) and gate delay model there are requirements on the structure of 

25 the input signals to effect an output change. Each pattern search in associative 
memory detects those signal values that have a certain attribute of the necessary 
structure (e.g. Those signals which have gone high within the last 3 time units). 
Those wires that have all the attributes indicate active gates. The wire values are 
stored in a memory block designated associative array 1 b(word-line-register bank). 

30 Only those gate types relevant to the applied search patterns are selected. This is 
accomplished by tagging a gate type to each word. These tags are held in 
associative array 1a. A specific gate type is activated by a parallel search of the 
designated tag in associative Arrayla. 
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This simple evaluation mechanism implies that the wires must be identified by the 
type of gate into which they flow since different gate types have different input wire 
sequences that activate them. Gates of a certain type are selected by a parallel 
search on gate type identifiers in associative array 1a. 

5 

Each signal attribute corresponds to a bit pattern search in memory. Since several 
attributes are normally required for an activated gate, the result of several pattern 
searches must be recorded. These searches can be considered as tests on words. 

10 The result of a test is either successful or not. This can be recorded as single bit in a 
corresponding word in another register held in a register bank termed the test-result- 
register bank: Since each gate is assumed to have two inputs (inverters and multiple 
input gates are translated into their 2-input gate circuit equivalents) tests are 
combined on pairs of words in this bank. This combination mechanism is specific to 

15 a delay model and defined by the result-activator register and consists of simple 
AND or OR operation between bits in the word pairs. 

The results of each combining each word pair, the final stage of the gate evaluation 
process, are stored as a single word in another associative array, the group-result 
20 register Bank 5. Active gates will have a unique bit pattern in this bank and can be 
identified by a parallel search for this bit pattern. Successful candidates of this search 
set their bit in the 1 -bit column register group-test hit list. 

The bits in each column position of every gate pair in the test-result register bank 4 
25 are combined in accordance to the logic operators defined in the result-activator 
register. The bits in each column are combined sequentially in time in order to reduce 
the number of output lines in the test-result-register bank 4. Thus, there is only one 
output line required for each gate pair in the test-result register bank, instead of one 
wire for each column position. 

30 

The result of the combination of gate pairs in the test-result register bank 4 are 
written column by column into the group-result register bank 5. Only one column in 
parallel is written at a particular clock edge. This implies only one input wire to the 
group-result register bank 5 is required per gate pair in the test-result register bank. 
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This reduces the number of connections from the test-result register bank to the 
group-result register bank. 

The scan registers are independent in so far as they can be decremented or 
5 incremented while other scan registers are disabled, however they are clocked in 
unison by one clock signal. 

The optimum number of scan registers is given by the inverse of the probability of a 
hit being detected in the hit list. 

10 

It is essential that an OR operations of all bits in the Hit-list is computed on one 
edge of a clock period to determine when all hit bits are clear and on the converse 
edge of the same clock cycle any scan register that is given access to its fan-out 
list is permitted to clear the hit bit that it has detected. The access is controlled by a 
1 5 wait semaphore system to ensure only one access at a time is made to each single 
ported memory. 

An alternative system consists of a multi-ported fan-out memory, consisting of 
several memory banks each of which can be simultaneously accessed. Each 
20 memory bank in the system has its own semaphore control mechanism. 

An alternative strategy has a hit bit enable the inputs of its fan-out list in the Input- 
value register. The enable connections from the hit list to the appropriate elements 
in the Input-value register bank are made prior to the commencement of the 

25 simulation and are determined by the connectivity between the gates in the circuit 
being simulated. These connections can be made by a dynamically configured 
device such as an FPGA (Field Programmable Gate Array) which can physically 
route the hit list element to its fan-out inputs. In the process all active Fan-out 
elements so connected will be enabled simultaneously and updated with the same 

30 logic value in parallel. 

The control core consists of a synchronised self-regulated sequence of events 
identified in one example, the Verilog code as eO, e1 , e2 etc. An event corresponds 
to the completion of a major task. The self-regulation means that there is no 
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software controlling the sequence of events, although there may be software 
external to the processor which will solicit information concerning the status of the 
processor. Furthermore, it implies that there is no microprogramming involved in 
the design. This eliminates the need for a microprogrammed unit and increases the 
5 speed of processing. 

In the fan-out update activity controlled, for example, by e20, it is essential that the 
event that the Multiple response resolver 7 has no more hits to be detected, 
terminates this activity. There is a choice that this activity be terminated by the 
10 event that all the hit-list has been scanned. However, detection that no more hits 
exist can terminate prematurely this fan-out update procedure and leads to a faster 
execution time of this procedure. 

Some logic entities may have delays which exceed the time frame representable in 
15 the word of associative array 1b. Larger delays can be modelled by associating a 
state with a gate type. In this case a gate and its state are defined in associative 
array 1a. Tests are performed on associative array 1b and when a gate with a 
given state passes some input value critique in addition to the fan-out components 
of the gate possibly being affected, the Gate state is amended in Associative array 
20 1 a. This new state may also cause a new output value to be ascribed to the fan-out 
list- of the gate. The tests that are applied are determined by the gate type and 
state. In this mechanism the fan-out list of a gate includes the normal fan-out 
inputs and the address in associative array 1a of the gate itself. 

25 In order to determine whether the state or the state and the fan-out gates are to be 
updated the state (a binary value) can serve as an offset into the gate's fan-out 
update data files. The state is added to the start location of each of a gates data files 
and this enables the gates normal fan-out list to be bypassed or not. 

30 The interconnect between logic entities being simulated can be modelled using a 
large delay model described below. Furthermore, single wires can be modelled by 
one word instead of two in associative array 1a , associative array 1b and the test- 
result register bank 4. Branch points are modelled as separate wires permitting 
different branch points to have different delay characteristics. 
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An efficient implementation uses single word versions of associative array 1a, 
associative array 1 b and the test-result register bank. 

5 The APPLES gate evaluation mechanism selects gates of a certain type, applies a 
sequence of bit patterns searches (tests) to them and ascertains the active gates by 
recording the result of each pattern search and determining those that have fulfilled 
all the necessary tests. This mechanism executes gate evaluation in constant 
time — the parallel search is independent of the number of words. This is an effective 
10 linear speedup for the evaluation activity. It also facilitates different delay models 
since a delay model can be defined by a set of search patterns. Further discussion of 
this is given below. 

Active gates set their bits in the column hit list. A multiple response resolver scans 
15 through this list. The multiple resolver can be a single counter which inspects the 
entire list from top to bottom which stops when it encounters a set bit and then uses 
its current value as a vector for the fan-out list of the identified active gate. This list 
has the addresses of the fan-out gate inputs in an input-value register bank. The new 
logic value of the active gates are written into the appropriate word of this bank. 



It then clears the bit before decrementing through the remainder of the list and 
repeating this process. All hit bits are Ored together so that when all bits are clear. 
This can be detected immediately and no further scanning need be done. 

25 Several scan registers can be used in the multiple response resolver to scan the 
column hit list in parallel. Each operates autonomously except when two or more 
registers simultaneously detect a hit; a clash has occurred. Then each scan 
register must wait until it is arbitrarily allowed to access and update its fan-out list. 
Each register scans an equal size portion. The frequency of clashes depends on 

30 the probability of a hit for each scan register, typically this probability is between 
0.01 and 0.001 for digital circuits. The timing mechanism in APPLES enables only 
active gates to be identified and the multiple scan register structure provides a 
pipeline of gates to be updated for the current time interval without an explicit 
scheduling mechanism. The scheduler has been substituted by this more efficient 
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parallel scan procedure. 

When all gate types have been evaluated for the current time interval all signals are 
updated by shifting in parallel the words of the Input-value register into the 
5 corresponding words of the word-line register bank. For 8 valued logic (i.e. 3 bits for 
each word in the Input-value register) this phase requires 3 machine cycles. The 
input-value register bank can be implemented as a multi-ported memory system 
which allows several input values to be updated simultaneously provided that the 
values are located in different memory banks. Other logic values can be used. 

10 

The APPLES bit shift mechanism has made the role of a scheduler redundant. 
Furthermore, it enables the gate evaluation process to be executed in memory, 
thereby avoiding the traditional Von Neumann bottleneck. Each word pair in array 
1 b is effectively a processor. Major issues which cause a large overhead in other 
15 parallel logic simulation are "deadlock" and scheduling issues. 

Deadlock occurs in the Chandy-Misra algorithm due to two rules required for 
temporal correctness, an input waiting rule and an output waiting rule. Rule one is 
observed by the update mechanism of APPLES. For any time interval J- t to T l+1f all 

20 words in array 1b reflect the state of wires at time T t and at the end of the 
evaluation and update process all wires have be updated to time T k1 . All wires have 
been incremented by the smallest timestamp, one discrete time unit. Thus at the 
start of every time interval all gates can be evaluated with confidence that the input 
values are correct. The Output rule is imposed to ensure that a signal values arrive 

25 for processing in non-decreasing timestamp order. This is guaranteed in APPLES, 
since all signal values maintain there temporal order in each word through the shift 
operation. Unlike the Chandy-Misra algorithm deadlock is impossible as every gate 
can be evaluated at each time interval. 

30 There is no scheduler in the APPLES system. Complex modelling such as Inertia! 
delays have confronted schedulers with costly (timewise) unscheduling problems. 
Gates which have been scheduled to become active need to be de-scheduled 
when input signals are found to be less than some predefined minimum duration. 
This with the normal scheduling tasks contributes to an onerous overhead. 
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Fig. 2 displays the equivalent mechanism in APPLES. An AND gate has two inputs 
a and b, assume that unless signals are at least of three units duration no effect 
occurs at the output, the simulation involves only binary values 0 and 1 and each 
5 bit in Arraylb represents one time unit. Signal b is constant at value 1 , while signal 
a is at logic 1 for two time units, less than the minimum time. This will be detected 
by the parallel search generated by the input and mask register combination and 
the gate will not become active. 

10 The circuit is now ready to be simulated by APPLES and is parsed to generate the 
gate type and delay model and topology information required to initialise associative 
arrays 1 a, 1 b and the fan-out vector tables. There is no limit on the number of fan- 
out gates. 

15 The APPLES processor assumes that the circuit to be simulated has been 
translated into an equivalent circuit composed solely of 2-input logic gates. Thus, 
every gate has two wires leading into it (an inverter has two wires from one source). 
These wires are organised as adjacent words in associative array 1b 1 called a 
word set. Associative array 1a 1 contains identifiers from every wire indicated the 

20 type of gate and input into which the wire is connected. The identifiers are in an 
associative memory that when a particular gate evaluation test is executed, putting 
the relevant bit patterns into Input-regla and mask-reg1a specifies the gate type. 
All wires connected to such gates will be identified by a parallel search on 
associative array 1a and these will be used to activate the appropriate words in 

25 associative arraylb (word-line register bank). Thus, gate evaluation tests will only 
be active on the relevant word sets. 

The input-value register bank 17 contains the current input value for each wire. 
The three leftmost bits of every word in associative array 1 b are shifted from this 
30 bank in parallel when ail signal values are being updated by one time unit. During 
the update phase of the simulation, fan-out wires of active gates are identified and 
the corresponding words in the input-value register bank amended. 

Simulation progresses in discrete time units. For any time interval, each gate type is 
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evaluated by applying tests on associative array 1b and combining and recording 
results in the neighbouring register banks. Regardless of the number of gates to be 
evaluated this process occupies between 10 machine cycles for the simplest, to 20 
machine cycles for the more complex gate delay models, see Fig. 3. Once the fan- 
5 out gate inputs have been amended, all wires are time incremented through a 
parallel shift operation of 3 machine cycle duration. In general, for 2 N valued logic N 
shift operations are required to update all signal values. 

Fig. 3 illustrates a simulation cycle. In the simulation cycle, the task particularly 
10 affected by the circuit size is that of scanning the hit list. As a circuit grows in size 
the list and sequential scan time expand proportionately. Analogous to the 
conventional communication overhead problem, the APPLES architecture 
incorporates a scan mechanism which can effectively increase the scan rate as the 
hit list expands. Thus, there is provided a multiple scan register structure. As will 
15 be described, one of the features of the present invention is the parallelisation of 
the application of test vectors in the gate evaluation phase as will be described 
hereinafter. Similarly, Fig. 4 is a search test pattern for an AND gate. 

The series of signal values that appear on a wire over a period of discrete time units 
20 can be represented as a sequence of numbers. For example, in a binary system if a 
wire has a series of logic values, 1 ,1 ,0 applied to it at times to, t, and t 2> respectively, 
where to< t,< t 2 . The history of signal values on this wire can be denoted as a bit 
sequence 01 1; the further left the bit position, the more recent the value appeared 
on the wire. 

25 

Different delay models involve signal values over various time intervals. In any 
model, signal values stored in a word which are irrelevant are masked out of the 
search pattern. 

30 The process of updating the signal values of a particular wire is achieved by shifting 
right by one time unit all values and positioning the current value into the leftmost 
position. Associative arraylb can shift right all its words in unison. The new current 
values are shifted into associative arraylb from the Input-value register bank. 
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Referring to Fig. 4, there is illustrated the parallel search patterns for an AND gate 
transition to logic "0". 

With wire signal values represented as bit sequences in associative memory words, 
5 the task of gate evaluations can be executed as a sequence of parallel pattern 
searches. Figure 4 depicts the situation where 8-valued logic has been employed 
and the AND gate has been arbitrarily modelled as having a 1 unit delay. 

Any gate which has any input satisfying T, and no(none) input satisfying T 2 will 
10 transition to 0. 

Consequently, to determine if the output of this gate is going to transition from logic 
1 to logic 0 it is necessary to know the signal values at the current time tc and t^. 
The current values are contained in the leftmost three bits of the word set. Figure 4 
15 declares the current values on the two inputs-as logic 1=111' and logic 0='000' and 
the previous values as both logic 1 . 

To ascertain if this AND gate has an output transition to logic (Vtwo simple bit 
pattern tests will suffice. If ANY current input value is logic 0 (Test T,) and NONE of 

20 the previous input values are logic 0 (Test TJ, then the output will change to logic 
0. These are the only conditions for this delay model, which will effect this 
transition. With associative memory any portion of a word can be active or passive 
in a search. Thus, putting l 000' and '111' into the leftmost three bits of the search 
and mask registers of associative array 1b can execute test T,. Test T 2 can be 

25 executed by essentially the same test on the next leftmost three bit positions. 

In general each test is applied one at a time. The result of test T s on wordj is stored 
in the i th bit position of word; in the test-result register bank 4. AT indicates a 
successful test outcome. For each word set, for every test it is necessary to know if 
30 ANY or BOTH or NONE of the inputs passed the particular test. If the i m bits of 
wordj and word H in the test-result register bank are Ored together and the result of 
this operation is T, then at least one input in the corresponding word set passed 
the test Tj — the ANY condition test. If the result of the operation is '0' then no inputs 
passed test T,- the NONE condition test. Finally, if the i th bits are Anded together 
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and the result is '1 ' then BOTH have passed test T r 

The result-activator register 14 combines results which are subsequently 
ascertained by the group-result register. The logical interaction is shown in Fig. 5. 

The And or Or operations between the bit positions is dictated by the result 
activator register. A , 0' in the i m bit position of the result activator register performs 
an Or action on the results of test Tj for each word set in the tesj-result register 
bank and conversely a T an And action. Each i* 1 And or Or operation is enacted in 
parallel through all word set Test result register pairs. 

The results of the activity of the result activator register on each word set Test 
result register pair are saved in an associated group result register. Apart from 
retaining the results for a particular word set, the group result registers are 
composite elements in an associative array. This facilitates a parallel search for a 
particular result pattern and thus identifies all active gates. These gates are 
identified as hits (of the search in the group result register bank) in the group-test 
hit list. 

Returning to the AND gate transition to logic '0* example, an AND gate will be 
identified as fulfilling the test requisites, any input passes test T t and none passing 
test T 2 , if its corresponding group result register has the bit sequence '10' in the first 
two bit positions. 

The APPLE components involved in the gate evaluation phase and their 
sequencing are shown in Fig. 6. 

With the present invention, one of the major features of the method is the storing of 
each line signal to a target logic gate as a plurality of bits, each representing a 
delay of one time period. The aggregate bits will allow the signal output to and 
reception by the target logic gate to be accurately expressed. Thus, these are 
represented in the same manner as the inherent delay of each logic gate. What 
must be appreciated now is that as the speed of circuits increases, the time taken 
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tp transmit a message between two logic gates can be considerable. Thus, the 
lines, as well as the logic gates, have to be considered as logic entities. 

Some logic entities may have delays which exceed the time frame representable in 
5 the word of associative array 1 b. Larger delays can be modelled by associating a 
state with a gate type. In this case a gate and its state are defined in associative 
array 1a. Tests are performed on associative array 1b and when a gate with a 
given state passes some input value critique, in addition to the fan-out components 
of the gate possibly being affected, the Gate state is amended in Associative array 
10 1a. This new state may also cause a new output value to be ascribed to the fan-out 
list of the gate. The tests that are applied are determined by the gate type and 
state. In this mechanism the fan-Array 1 a of the gate itself. 

In order to determine whether the state or the state and the fan-out gates are to be 
15 updated the state( a binary value) can serve as a selector of the gate's fan-out 
update data files. The state amends the access point relative to the start location of a 
gates data files and this enables the gates normal fan-out list to be bypassed or not 

On commencement of filling a new time frame (a word in associative array 1b), a 
20 special symbol is inserted into the left-most{most recent time) position. This symbol 
conveys the input value on the gate and serves as a marker. When the marker 
reaches the right-most position in the word, this indicates that a complete time 
frame has passed. This can be detected by the normal parallel test-pattern search 
technique on associative array 1b (See Figure 1). 

25 

The interconnect between logic entities being simulated can be modelled using the 
large delay model described above. Furthermore, single wires can be modelled by 
one word instead of two in associative array 1a , associative array 1b and the test- 
result register bank. Branch points are modelled as separate wires permitting 
30 different branch points to have different delay characteristics. 

In effect, what is done is each delay is stored as a delay word in an associative 
memory forming part of the associative memory mechanism. The length of the 
delay word is ascertained and if the delay word width exceeds the associative 
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register word width, then it cannot be stored in the register simply. Then, the 
number of integer multiples of the register word width contained within the delay 
word is calculated as a gate state. This gate state is stored in a further state 
register, in effect, the associative register or associative array 1 a. The remainder 
5 from the calculation is stored in the associative register array 1b with those delay 
words whose width did not exceed the associative register width as well as with 
those words who did. Then, on the count of the associative register 1 6 
commencing, the state register is consulted, that is to say, the associative register 
1a, and the delay word entered into the register. The remainder is ignored for this 
10 count of the associative register array 1b. At the end of the count of the 

associative register 1b, the associative register 1a is updated by decrementing one 
unit. If this still does not allow the count to take place, the process is repeated. If, 
however, the associative register 1a is cleared, then the count continues and the 
remainder now represents the count required. 

15 

Complex delay models such as inertia) delays require conventional sequential and 
parallel logic simulators to unschedule events when some timing critique is 
violated. This expends an extremely time consuming search through an event list. 
In the present invention, inertial delays only require verification that signals are at 
20 least some minimum time width; implementable as a single pattern search. 

An ambiguous delay is more complicated where the statistical behaviour of a gate 
conveys an uncertainty in the output. A gate output acquires an unknown value 
between some parameters (M time units) and t^ (N time units). Using 4- 
25 valued logic, APPLES detects an initial output change to the unknown value at 
time t^, followed by the transition from unknown value to logic state '0' at time t^, 
see Fig. 7. Hazard conditions, where both inputs simultaneously switch to 
converse values can also be detected, which is illustrated in Fig. 7. 

30 For each gate type, the evaluation time T gate . eva , remains constant, typically ranging 
from 1 0 to 20 machine cycles. The time to scan the hit list depends on its length 
and the number of registers employed in the scan. N scan registers can divide a 
Hit list of H locations into N equal partitions of size H/N. Assuming a location can 
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be scanned in 1 machine cycle, the scan time, Tscan is H/N cycles. Likewise it will 
be assumed that 1 cycle will be sufficient to make 1 fan-out update. 

For one scan register partition, the number of updates is (Prob^H/N. tf all N 
5 partitions update without interference from other partitions this also represents the 
total update time for the entire system. However, while one fan-out is being 
updated, other registers continue to scan and hits in these partitions may have to 
wait and queue. The probability of this happening increases with the number of 
partitions and is given by N C,(Prob nit )H/N. 

0 

A clash occurs when two or more registers simultaneously detect a hit and attempt 
to access the single ported fan-out memory. In these circumstances, a semaphore 
arbitrarily authorises waiting registers accesses to memory. The number of 
clashes during a scan is, 

5 

No. clashes = (Prob of 2 hits per inspection) x H/N 
+ Higher order probabilities. 

0) 

The low activity rate of circuits (typically 1%-5% of the total gate count) implies that 
0 higher order probabilities can be ignored. Assume a uniform random distribution of 
hits and let Prob*, be the probability that the register will encounter a hit on an 
inspection. Then (1) becomes, 



5 



No. clashes = N C 2 (Prob hit ) 2 x H/N 

(2) 

Thus, T N the average total time required to scan and update the fan-out lists of a 
partition for a particular gate type is, 

■ N = Tgate^ai + Tscan + T up<Jate + T^^,, 

0 = Tga,^, + H/N + N C, (Prob n ,)H/N + N C 2 (ProbJ 2 x H/N 

(3) 



Since all partitions are scanned in parallel, T N also corresponds to the processing 
time for an N scan register system. Thus, the speedup S P =T,/T N , of such as system 
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is, 



Ti/Tn~ Tgate-evai + Tscan + 

5 

Tgatc-evai + H/N + N C t (Prob^H/N + (Prob^) 2 x H/N 

(4) 

10 

Eqt (4) has been validated empirically. Predicted results are within 20% of 
observed for. sample circuits C7552 and C2670 and 30% for C1908. Non- 
uniformity of hit distribution appears to be the cause for this deviation. 

15 Differentiating T N w.r.t N and ignoring 2 nd order and higher powers of Prob^ the 
optimum number of scan registers Nop timum and corresponding optimum speedup 
Soptimum is given by, 



N opttmum - (V^/Prob^ (5) 

20 

S avM i mum .1/(2.4 xProb*) (6) 



Thus, the optimum number of scan registers is determined inversely by the 
probability of a hit being encountered in the Hit list. In APPLES, the important 
25 processing metric is the rate at which gates can be evaluated and their fan-out lists 
updated . As the probability of a hit increases there will be a reciprocal increase in 
the rate at which gates are updated. Circuits under simulation which happen to 
exhibit higher hit rates will have a higher update rate. 

30 When the average fan-out time is not one cycle, Prob^ is multiplied by Fout, where 
Fout is the effective average fan-out time. 



35 



A higher hit rate can also be accomplished through the introduction of extra 
registers. An increase in registers increases the hit rate and the number of clashes. 
The increase halts when the hit rate equals the fan-out update rate, this occurs at 
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Noptinu^. This situation is analogous to a saturated pipeline. Further increases in the 
number of registers serves to only increase the number of clashes and waiting lists 
of those registers attempting to update fan-out lists. 

Further simulations were carried out, again with a Verilog model of APPLES 
simulated 4 ISCAS-85 benchmarks, C7552(4392 gates), C2670(1736 gates), 
C1 908(1 286 gates), C880(622 gates) using a unit delay model. Each was 
exercised with 10 random input vectors over a time period ranging from 1,000 to 
10,000 machine cycles. Statistics were gathered as the number of scan registers 
varied from 1 to 50. The speedup relative to the number of scan registers is shown 
in Table 1 . 



No. Scan Registers 



15 



C7552 
C2670 
CI 908 
C880 



15 

12.5 
9.7 
8.4 
7.8 



30 

19.9 
13.8 
10.8 
8.3 

Speedup 



50 
24.3 
15.9 
11.8 

9.7 



(a) 



No. Scan Registers 
30 50 



15 

13.6 
12.5 
11.8 
11.1 



24.3 29.6 
20.0 25.1 
17.3 20.9 
12.6 15.9 



Speedup(exci Fixed size 
Overheads) 



(b) 



Table 1 . Speedup Performance of Benchmarks 



20 



25 



Table (1.a) demonstrates that in general the speedup increases with the number of 
scan registers. The fixed sized overheads of gate evaluation, shifting inputs etc, 
tends to penalise the performance for the smaller circuits with a large number of 
registers. A more balanced analysis is obtained by factoring out all fixed time 
overheads in the simulation results. This reflects the performance of realistic, large 
circuits where the fixed overheads will be negligible to the scan time. Table (1.b) 
details the results with this correction. As expected this correction has lesser affect 
on the larger bench mark circuits. 
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Av. No. Cycles /Gate Processed 
No. Scan Registers 





1 


15 


30 


50 


C7S52 


154.6 


1 1.3 


6.4 


5.2 


C2670 


101.9 


8.0 


5.1 


3.9 


C1908 


86.9 


6.8 


5.1 


3.9 


C88O 


49.9 


4.9 


4.2 


3.6 



Table 2. Average No. of machine cycles per gate processed 

Taking the corrected simulated performance statistics, Table (2) displays the 
5 average number of machine cycles expended to process a gate. The APPLES 
system detects intrinsically only active gates, no futile updates or processing is 
executed. The data takes into account the scan time between hits and the time to 
update the fan-out lists. As more registers are introduced the time between hits 
reduces and the gate update rate increases. Clashes happen and active gates are 
10 effectively queued in a fan-out/update pipeline. The speedup saturates when the 
fan-out/update rate, governed by the size of the average fan-out list, equals the rate 
at which they enter the pipeline. 

The benchmark performance of the circuits also permits an assessment of the 
15 validity of the theory for the speedup. From the speedup measurements in 
Tablel.(b) the corresponding value for f av was calculated using Eqt(7). This value 
representing the average fan-out update time in machine cycles, should be constant 
regardless of the number of scan registers. Furthermore, for the evaluated 
benchmarks the fan-out ranged from 0 to 3 gates and the probability of a hit, Prob hit , 
20 was found to be 0.01 + 5%. Within one and a half clock cycles it is possible to 
update 2 fan-out gates, therefore depending on the circuit f av should be in the range 
0.5 to 1 .5. The calculated values f av for are shown in Table 3. 
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No. Scan 








Registers 








15 


30 


50 




Av 






C7552 


0.41 


0.35 


0.88 




0.55 






C2670 


0.52 


0.79 


1.26 




0.86 






C1908 


0.77 


1.21 


1.32 




1.10 






C880 


0.16 


1.98 


1.54 




1.22 







fav 



Table 3. The Average Fan-out Update Time (in machine cycles) for the 
5 Benchmarks 

The values for f av are in accord with the range expected for the fan-out of these 
circuits. The fluctuations in value across a row for f av> where it should be constant 
are possibly due to the relatively small number of samples and size of circuits, 
0 where a small perturbation in the distribution of hits in the hit-list can affect 
significantly the speedup figures. In the case of C880, a 10% drop in speedup can 
effectively lead to a ten-fold increase in f av . 



For comparison purposes Table 4 uses data from Banerjee: Parallel Algorithms for 
15 VLSI Computer-Aided Design. Prentice-Hall, 1994 which illustrates the speedup 
performance on various parallel architectures for circuits of similar size to those 
used in this paper. This indicates that APPLES consistently offers higher speedup. 

For comparison purposes Table 4 uses data from Banerjee: Parallel Algorithms for 
20 VLSI Computer-Aided Design. Prentice-Hall, 1994 which illustrates the speedup 
performance on various parallel architectures for circuits of similar size to those 
used in this paper. This indicates that APPLES consistently offers higher speedup. 
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Archrtecture 



Synchronous 
Shared Distributed 
Memory Memory 



Asynchronous 

Shared Distributed 
Memory Memory 



Circuit 



Multiplier (4990 gates) 5.0/8 / 

H-FRISC (5060 gates) 3.7/8 / 

1 0 S1 5850 (9772 gates) / 3.2/8 

S13207 (7951 gates) / 3.2/8 

Adder (400 gates) / / 

QRS (1000 gates) / / 



15 



5.0/8,5.8,14 
7.0/8, 8.2/14 

I 

I 

4.5/16. 6.5/32 
5.0/16, 7.0/32 



/ 
/ 

/ 
/ 
/ 



Speedup Performance for Various Parallel Systems 

Notation a/b, where a = Speedup value, b = No. Processors. 

Double entries denote two different systems of the same architecture 



20 



TABLE 4 - A speedup comparison of other parallel architectures 

The following from pages 28 to 54 is one example of an implementation of the 
present invention in software written in Verilog. 
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Verilog Description of APPLES 
Associative Array la 

Description: Each word of this array holds a bit sequence identifying the gate type input 
connection of a wire, in the corresponding position in Associative Arraylb. The input/mask 
renter combination defines a gate type that will be activated for searching in Associative 
Arrayla. Words that successfully match are indicated in a 1-bit column register. The array also 
has write capabilties. 

module Ary_la ( Input_regla , Masker egla , Adr_regla . Clock , 
Search_enblla,Write_enblla,Activ_lstla) ; 

" ^Asi"^ ~r-~*Xa are the Xnput,Mask and Address registers 

Sch 9 "^ 11 * 13 Set ' ^ ne * ative ed 9e of Clock initiates a parallel 

^3li S wh?^ S 3 C ° 1U T register that indicates those words in Associative 
Arrayla which compared successfully with the search pattern. // 

parameter Ary_la_wdth=r7 ; 
parameter Aryla_size=16383 ; 
integer Ary_index; 

input Clock, Search_enblla , Writ e_enblla ; 

input [Ary_la_wdth:0] Input_regla, Mask_regla, Adr^regla; 

output [Aryla_size:0] Activ_lstla; 
reg { Aryla_size : 0 ] Activ_lstla; 

reg (Ary_la_wdth : 0 ] Aryla_ass_mem [ 0 :Ary 1 a.size ] , Temp_reg; 

initial 
begin 

$readmemb( "Aryla.dat" ,Aryla_ass_mem) ,- 
// Aryla.dat is the data file defining the gate and model types in the circuit.// 

f °begi^ indeX= ° ; Ary - index< ^ Ar y la ~ s i^e; Ary_index=Ary_index+l) 

Activ_lstla [ Ary_index} =0 
end 

end 

always ©(negedge Clock) 
begin ' 

if ( S ear c h_enb 1 1 a ) 
begin 1 

for (Ary_index=0; Ary_index<=Aryla_ S ize; Ary_index=Ary index-1) 
begin — 

Temp_reg=Aryla_ass_mem(Ary_index] ; 

if ((-Mask_regla | (Input_regla & Temp_reg) | 

_ - . (-Input^regla & -Temp_reg ) ) = = 8 ' hf f ) 

Activ_lstla{Ary_index}=l; 

else 
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Activ_istla(Ary_index] =0 ; 

end 
end 

if (Write_enblla) Aryla_ass_mem[Adr_regla] = Input_regla ; 
end 

endmodule 

Associative Array lb 

Description: Every word in this array represents the temporal spread of signal values on a 
specific wire. The most recent values being leftmost in each word. All words can be 
simultaneously shifted right, effecting a one unit time increment on all wires. The signal values 
are updated from a 1-bit column register. The array has parallel search and read and write 
capabilities. 



module Ary__lb ( Search_r eglb, Mask_reglb, Adr_reglb, Datain^reglb, 

Dataout_reglb, Hit_buf f r_reglb, Shf t_enbl , Search_enbllb, 
Write_enbl , Read_enbl , Clock , Input_bit , 
Word_line_enbl ) ; 

// Search_reglb, Mask_reglb, Adr_reglb, Dat ain_reglb # Dataout^raglb are the 

Search, Mask, Address , Data -in and data-out registers of Associative Arraylb. 
When S©arch_onbllb is set, the negative edge of Clock initiates a parallel 
search. Likewise, a read or write operation is executed on the negative edge of 
the clock if Write_enbl or Read.enbl is asserted. 

A parallel search is initiated on a negative edge of the Clock if Search^onbllb is 
set. This search is only active on those words that are primed for searching by 
the Word_lin©_enbl column regsiter. The bits in this register are set/cleared by 
Activ_lstla of Associative Arrayla . This effectively selects gates of a certain 
gate type and delay model. Words that match are identified by bit being set in the 
corresponding position in Hit_buf £ir_ireglfa . 

Words are shifted right in parallel with the leftmost bit being taken from 
Input _b it . / / 



parameter Arylb_mem_size=163 83 ; 
parameter Wlr_wrdsize = 31; 
parameter Shf t_dly=2 ,- 
parameter Adr_reg_bits = 13 ,- 

input (Wlr_wrdsize :0) Search.reglb, Mask_reglb, Datain_reglb; 
input [ Arylb_mem_size:0] Input^bit , Word_line_enbl ; 

input Clocks- 
input Shf t_enbl , Search_enbl lb . Wr i te_enbl , Read_enbl ; 



reg (Wlr_wrdsize : 0 ) Temper eg 1; 

reg ( Wlr_wrdsize : 0 ] Wlr_Ass_mem( 0 : Arylb.mem_size] ; 
input ( Adr_reg_bits : 0) Adr_reglb; 



ou t pu t ( Ary 1 b_mem_s i z e : 0 ] Hit _bu f f r _r eg 1 b ; 

reg (Arylb_mem_size : 0] Hit_buf f r_reglb; 

output (Wlr_wrdsize : 0 ] Da taout_reg lb ; 

reg (Wlr_wrdsize : 0 ) Da taout_reglb ; 
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integer Mem_indx; 

initial Sreadmembt "Arraylb.dat " . Wlr_Ass_mem) • 

iSST^.}? filS WhlCh ail the words in Arrraylb to the 

always 8<negedge Clock) 
begin 

if (Shft_enbl) 
begin 

for ^xndx=0; Mem_indx<= A rylb _me m _size , Me^indx- Mem.indx * 1, 

Temper eg 1 = Wlr_Ass_mero [Mein_indx] 
Teinp_regl= Temp_regl » 1; 

Temp.reguwir.wrdsize] = Input_bit [Mem_indx] ; 
wlr — Ass ~roem(Mem_indx] = Temp_regl; 



end 



end 



else 

if (Search.enbllb) 
begin 

for (Mem_indx=0; Mem_indx<=Arylb mem size - 4rwW 
begin xxj_mem_s lze , Mem_indx = Mem_indx + 1) 

if <Word_line_enbl tMem_indx] ) 
begin 

Temp^regl = Wlr_Ass_mem [Mem_indx] ; 

if (<-Mask_reglb | (Search.reglb & Teinp_regl) I 
begin ~ Search - re * lb & -Temp_regl) ) ==32 • hf f f f f f f f ) 

Hit_buf fr_reglb[Mem_indx] = i- 

end 

else 

begin 

Hi t_bu f f r__r eglb I Meitv^indx ] = 0- 
end 

end 
else 

Hi t.bu f f r_r eglb [ Mem indx] = 0- 
end ~ 

end 
else 

if (Write^enbl) 

Wlr_Ass_mem(Adr_reg;lb] = Datain_reglb; 

else 

if (Read_enbl) 

Dataout.reglb = Wlr.Ass.mem (Adr_reglb) ; 



end 
endmodule 
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Test-result register Bank 

Description: When an 1 th search is executed on Associative Arraylb, if wordj in Arraylb matches 
the search pattern, then biti in wordj of the Test-result register bank will be set, otherwise it is 
cleared. The Result-activator register specifies the logical combination between pairs of words < a 
gate's set of inputs). The result of this combination of word pairs is a column register (half the 
length of the number of word pairs). 

module Tst_rslt_reg_bank ( Inp_buf f r_reg, Trr_wrt_enbl , Comb_enbl , Clock, 

Out_buf fr_reg,Rs lt_act_r eg, Wri tempos f Rset) ; 

// Inp_buf f r_reg is a column of bits describing the outcome of a search on each 
word in Arraylb. This bit column is written into a column of the Test-result 
register bank on the negative edge of Clock when Trr_wrt_enbl is asserted. The 
position of this coulmn is defined by Write _poa . 

Word pairs are combined according to the bit sequence in Rslt_act_reg. A *0' in 
biti of Rslt_act_reg ORs the . i th bits in each word pair and produces the result for 
each pair in Out_buf£r_r«g. This combination is executed on the negative edge of 
Clock when Coiub_exibl is asserted. Rsot resets all the bits in the Test-result 
register bank.// 

parameter Trr_word_size=7 ; 
par ame t er Trr_mem_s ize=16383; 
parameter Trr_ou t_s ize=8191 ; 
parameter Trr_wdth_spec=2 ; 

reg (Trr_word_size : 0 ] Trr_array { 0 :Trr_mem_size] ; 
reg [Trr.word^size : 0 ) Temp_regl , Temp_reg2 
reg Rsl traction; 



inpu t [ Trr_mem_s i z e : 0 ] Inp_bu f f r_r eg ; 
inpu t I Trr_word_s i z e : 0 J Rsl t.ac t _r eg ; 
input [Trr_wdth_spec : 0 ) Wri te_pos ; 
input Clocks- 
input Trr_wrt_enbl ; 
input Comb_enbl ; 
input Rset; 



output [Trr_out_size: 0] Out_buf f r.reg ; 
reg ( Trr_out_size : 0 ] Out_buf f r_reg ? 

integer Bank_index , i ; 



always @(negedge Clock) 
begin 

if (Trr__wrt_en_bl ) 
begin 

for (Bank_index=0 ; Bank_index<=Trr_mem_size ; Bank_index=Bank_index+l ) 
begin 

Temper eg 1 =Tr r_arr ay [ Bank_index ) ; 

Temp^regl ( Write_pos ) =Inp_buf f r_reg [ Bank_index) ; 

Trr_array[Bank_index) =Temp_regl ; 

end 
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end 
else 



if (Comb_enbl) 
begin 

Rslt_action=Rslt_act_reg{Write_pos] ; 
for (i=0; i<=Trr_word_size; i=i+l) 
begin 

f or (BanK_index=0 ; Bank_index<Trr_ I nenusize ; BanK_i„de*=BanK_i„dex + 2 > 

Terap_regl=Trr_array [ Bank_index] ; 
Terap_reg2=Trr_array(Bank_index+l] ; 
if (Rslt_action==0) 

Out_buf f r_reg [Bank_index/2 ] = (Temp.regl [Writers ) | 
else Terop.reg2 t Write_pos ] ) ; 

Out^buf f r.reg [Bank_index/2 ] =Temp.regl [Wri tempos ) & 

Temp_reg2 [Wri tempos J ; 

end 
end 



end 



else 

if (Rset) 
begin 



end 



end 
endmodule 



WO 01/01298 




Group-result register Bank 

Description: The result of the combination of word pairs in the Test-result register is written as 
a column of bits into the Group-result register bank. When all combination results have been 
generated a parallel search is executed on the Group-result register to ascertain all word pairs in 
Arraylb that passed all the test pattern searches. 

module Grp_rslt_reg_bank (Grr_inp_reg, Grr.mask.reg, Grr_srch_reg, 

Clock , Srch_enbl . Wrt_enbl , Write_pos , 
Grr_hit_list) ; 

/ / Grr_inp_reg is shifted as a bit column into a column of the Group-resialt 
register bank defined by Write___pos . This column write operation is activated on 
the negative edge of Clock when Wrt_enbl is asserted. 

Grr_oask_reg and Orx^srch^reg compose a search pattern enacted on the negative 
edge of Clock when Srch_onbl is set. Pattern matches are indicated in 
Grr_hit_list . The Crr_liit_list is also known as the Group-test Hit list.// 

parameter Grr_mem_size=8191 ; 
parameter Grr_word_size=7; 
parameter Grr_wdth_spec=2 ; 

input (Grr_jnem_size: 0] Grr_inp_reg; 

input [Grr_word_size:0) Grr_mask_reg, Grr_srch_reg; 

input [Grr_jtfdth_spec : 0 J Write_pos ; 

input Clock, Srch_enbl , Wrt_enbl ; 

output [Grr_mem_size:0] Grr_hit_list ; 
reg (Grr_mem_size : 0] Grr_hi t_list ; 

reg [Grr_word_size : 0 ) Grr_array ( 0 : Grr__mem_size] ; 
reg (Grr_word_size : 0 ) Temper eg ; 

integer Bank_index; 

always @ (negedge Clock) 

if (Wrt_eixbl) 
begin 

for (Bank_index=0; Ba n k_ i ndex < = G r r_mem_ size; 

Bank_index=Bank_ index + 1) 

begin 

Temp_reg= Grr_array ( Bank_index ] ; 

Temp_reg [Wri tempos) = Grr_inp_reg [Bank_index) ; 

Grr.arr ay [ Bank_index J =Temp_reg ; 
end 
end 

else if (Srch_enbl) 

for (Bank_index=0; Bank_index<=Grr_mem_size ; 

Bank_index=Bank__index+l ) 

begin 

Temp_reg = Grr_array ( Bank_index ] ; 

if ( (-Grr_mask_reg | (Grr_srch_reg & Temp_reg) | 
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(-Grr_srch_reg & -Temp.reg) ) ==8 • hf f ) 
Grr_hit.list (Bank.index) = 1; 
else 

Grr_hit_list [Bank.index] = 0; 

end 

endrnodule 

Multiple-response resolver (Veirsnoss 1.0 SnimgJle Scam mmode) 

SS2f£ ^l**"*""* -* Gro U p-test Hit list ( a l-Mt column* 

JSfTw" ^ comm£m5:esa scam fe y inttfadUng its connter with the top address off the 
mt tat Ttas colter seirves as an address, register wfoidh facilitates reading off every Hift Hist b5(U 

I™T, ^ "J.."""?- ^ taMmt "* * ^ M " ,efatel 65 ~««-«h1 updated 
appropriately The fat ns to reset After reset ©r off (the Ut was already zero, the oJLJZZ 

tSST^ * 7^ £ ^ ^ atMreSS ta ^ 15SL P™*ss is repeated. The 

scanmnmg teraamates eafttoer wfeem all bits have been* inspected or all bits are zero. 

module Multiple_res_res (Grr_hit_list . Clock. 

Reset_ctr , End_scan_f lag. Decrmt_enbl , 

Fan_out_src_reg,Fan_out_si2e_reg.R S et_hit_fnd fig, 
Hit_fnd_f lag) ,- 

resolver' s counter wTth top Hst S'STt lto **^- t * loads *• 

-*=»r- Hi> « 1n<w • w , OI Mit __st. If the current inspected bit i <s 

parameter Grr_mem_s i z e= 8 1 9 1 ; 
parameter Vectr_tbl.adr_reg.bits =13 ; 
parameter Fanout_hdr_tbl_wdth=13 ; 
parameter Max_fan_out=7 ; 
parameter Inp_bnk__size=l63 83 ; 

input Reset.ctr , Rset.nit.f nd.f lg . Clock; 
inpu*_ ,Grr_mem_size : 0 J Grr.hit.list ; 

input Decntit.enbl ; 

output End.scan.f lag; 
reg End.scan.f lag; 

output Hit.fnd.f lag; 
reg Hit.fnd.f lag; 

output Fan_out_src_reg ; 

reg(Vectr_tbl_adr_reg_bits: 0} Fan.out_src.reg; 

output Fan_out_size_reg; 
reg[Max_fan_out : 0] Fan_out.size_reg; 

reg(Fanout.hdr__bl_wdth:0] Fan_out.hdr.tbl [0 : Inp.bnk.size) ; 

reg{Vectr.tbl.adr_reg.bits:0) Hit.lst.ctr; 

reg[Max_fan_out:0) Fan.out.size.tbl ( 0 : Inp.bnk size]; 
reglGrr.mem_size:0) Hi t.lst.buf f r ; 

reg Hi t.f nd_ORed.f lg , Ts t.or.bi t ; 
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integer Num_hits # Hit_dist , Sum_hit_dist , Prev_hit_lst__ctr , Avg_dist ; 
initial Sreadmemh ( "Fanout . dat " , Fan_out_hdr_tbl ) ; 

//The file Panout.dat contains the vectors for the start of the fan-out lists for 
every gate in the circuit being simulated.// 

initial $readmemh( "Fansize .dat" , Fan_out_size_tbl) ; 

//The file Pansiz6.dat specifies the size of the fan-out list for each gate being 
simulated.// 

initial forever 
begin 

£(Reset_ctr) 

if (Reset_ctr) 
begin 

Num_hits=0; 

Pr e v_hi t_lst_c tr=Grr_mem_s i z e ; 
Sum_hi t_di s t = 0 ; 
Hit_lst_buf f r=Grr_hit_list ; 
Tst_or_bit= |Grr_hit_list; 
$display("OR Check=%b" \ Tst_or_bit ) ; 
Hi t_ls t_c tr =Grr_mem_s i ze ; 
End_scan_f lag=0 ; 
Hit_fnd_f lag=0 ; 
Hit_fnd_ORed_f lg=l ; 

Sdisplay { "Initialisation seq executed"); 
end 
end 



always ©(negedge Clock) 
begin 

if ( (Decrmt_enbl) && (1 End_scan_f lag) ) 
begin 

Hit_fnd_ORed_f lg= |Hit_lst_buf fr ; 
if ((Hit_lst_ctr>0) && ( Hit__fnd_ORed_flg) ) 
begin 

if (Hit_lst_buf fr [Hit_lst_ctr } ==1) 
begin 

Num_hi t s =Num_h i t s + 1 ; 

Hit_dist=Prev_hit_lst_ctr - Hit_ls t_ctr ; 
Sum_hi t _di s t =H i t _di s t + Sum_hi t_di s t ; 

$display ( "Hit distance=%d' , Hit_dist « "Time=%d w , $time) ; 
Prev_hit_lst_ctr=Hit_lst_ctr; 

Fan_ou t_s i ze_r eg=Fan_ou t_s ize.tbl [Hi t__lst_ctr 1 ; 
Fan_out_src_reg=Fan_out_hdr_tbl [Hit_lst_ctr ) ; 
Hit_fnd_f lag=l; 

Hit_lst_buf f r (Hit_lst_ctr ] =0; 
end 

end 



if { (Hit_lst_ctr>0) && (• Hit_fnd_ORed_f lg) ) 
begin 
End_scan_f 1 ag= 1 ; 

$display ( "No of hits in fan-out list=%d" , Num_hits ) ; 

Avg__d i s t = Sum_h i t _d i s t / Num_hi t s ; 

$display ( "Average hit distance=%d* , Avg_dist ) ; 

end 



if (Hit_lst_ctr==0) 
begin 

if (Hit_lsc_buf f r (Hit_lsc_ccr] ==1) 
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begin 

Num__hi t s =Num_hi ts «► 1; 

Hit_dist=Prev_hit_lst_ctr-Hit_lst_ctr; 
$display ("Hit distance=%d" , Hit_dist ) ; 
Pr ev_hi t _1 s t_c tr =Hi t_l s t_c t r ; 
Sum_hi t_di s t =H i t _di s t +Sum_hi t _di s t ; 

Fan_out_size_reg=Fan_out_size_tbl [Hit_lst_ctr] ; 
Fan_out_src_reg=Fan_out_hdr_tbl [Hit_lst_ctr } ; 
Hit_fnd_flag=l; 
end 

End_scan_f lag=l ; 

Sdisplay ( -No of hits in fan-out list=%d" ,Num_hits) ; 
Avg_dis t=Sum__hit_dist/Num_hits ; 
$display( -Average hit distance=%d" , Avg_dist) ; 
end 

Hit_lst_ctr=Hit_lst_ctr -1; 
end 
end 

always @(posedge Rset_hit_fnd_f lg) 
begin 

Hit_fnd_flag=0; 
end 



endmodule 
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+ Multiple_Response Resolver (Version 2,0 Multiple Scan Mode) 

Description: The Multiple-response resolver scans the Group-test Hit list ( a 1-bit column 
register). The resolver in Multiple Scan Mode consists of several counter(scan) registers. Each is 
assigned an equal size portion of the Group-test Hit list. When the resolver is initialised all scan 
registers point to the top of their respective Hit list segment. The registers are synchronised by a 
single clock. The external functionality of the Multiple Scan Mode resolver is identical to that of 
the Single Scan Mode version. Internally, the Multiple Scan version uses a Wait semaphore to 
queue multiple accesses to the the fan-out lists. Registers which clash are queued arbitrarily and 
only recommence scanning after gaining permission to update their fan-out lists. Scanning 
terminates when all bits have been inspected or all bits are zero. 

module Multipie_resures (Grr_hit_list , Clk, 

Resec_ctr , End_scan_f lag , Decrmc_enbl . 

Fan_out_src_reg, Pan_out_si2e_reg, Rset_hi t_f nd_f Ig, 
Hit_fnd_f lag) ; 

// The Multiple_response_resolver inspects in parallel several bits of 
Garar_hifc_list on the negative edge of Clock while Decrmt_enbl is asserted. 
Resot^ctr loads the resolver' s scan registers with the top location of each 
respective segment of the Hit list. If any of the current inspected bits are set, 
Hit_fad_flag is asserted. The vector and the size (no. of gates) for the fan-out 
list of the segment which has been granted permission, is loaded into 
Fan_out_3rc_reg and Fan_out_size_xrog , respectively. Scanning halts for all 
registers awaiting permission. Permission is arbitrarily granted to a segment on 
the positive edge of Rset_hifc_f nd_f lg which is externally controlled. For 
registers that have not found a hit, a new bit is inspected on the negative edge 
of Clock. Scanning terminates when all bits have been inspected or reset to zero 
This condition is indicated by Bnd_scan_flag. // 



parameter Grr_mem_size=8191 ; 
parameter Vectr_tbl_adr_reg_bits=13 ; 
parameter Fanout_hdr_tbl_wdth=13 ,- 
parameter Max_fan_out=7 ; 
parameter Inp_bnk_size=163 83 ; 

input Reset_ctr , Rset_hit_£nd_t lg , Clk? 
input [Grr_mem_size : 0 ] Grr_hi t_l is t ; 

input Decrmt_enbl ; 

output End_scan_f lag ; 
reg End_scan_f lag ; 

output Hi t_f nd_f lag ; 
reg Hit_f nd_£ lag ; 

output Fan_out_src_reg ; 

reg ( Vectr__tbl_adr__reg_bits : 0 ] Fan_out_src_reg ; 

output Fan_out_size_reg ; 

reg I Max_f an_out : 0 ] Fan_out_si ze_reg ; 

reg(Fanout_hdr_tbl_wdth:0] Fan_ou t_hdr_tbl (0 : Inp_bnk_size] ; 
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regfMax_fan_out :0) Fan_out_size_tbl [0 : Inp_bnk_size } ; 
reg(Grr_mem_size:0] Hit_lst_buf f r ; 



reg Hit_f nd_ORed_f lg , Tst_or_bit , Mpl_scan_enbl ; 



integer Num_hits , Num_hits_ratio, Start_time, Finis h_ time; 



S 



reg decrmt_enbll , decrmt_enbl2 , decrmt_enbl3 , decrmt_enbl4 , mem.access ; 
reg decrmt_enbl5 , decrmt_enbl6 , decrmt_enbl7 , decrmt_enbl8 ; 



reg decnnt_enbl25 , decrmt_enbl26 , decrmt_enbl27 , decxrmt_enbl28 ; 
reg decrmt_enbl29 , decrmt_enbl30 ; 

//These registers enable a segment to be scanned when asserted. This program 
assumes that the list is divided into 30 equalled size segments.// 



integer cl , c2 , c3 , c4 , c5 , c6 r c7 , c8 ; 



integer c25 , c26 , c27 . c28 , c29 , c3 0 , Total; 

reg [ Vectr_tbl_adr_reg_bits -. 0 ] posl , pos2 . pos3 , pos4 , pos5, pos6 , pos7 , pos8 ; 



reg [ Vec tr_tbl_adr_r eg_bi t s : 0 ] pos2 5 . pos2 6 , pos2 7 , pos 2 8 , pos 2 9 , pos 3 0 ; 
// These are the scan ragisttrs fox each segment.// 



parameter 


upr_ 


JLtl= 


149; 




parameter 


lwr_ 


_ltl= 


0; 




parameter 


upr_ 


_lt2 = 


299; 




parameter 


lwr_ 


,lt2= 


150; 




parameter 


upr_ 


lt3 = 


449; 




parameter 


lwr_ 


_lt3 = 


300; 




parameter 


upr. 


_lt4= 


599, 




parameter 


lwr_ 


lt4 = 


450 




parameter 


upr_ 


_lt5= 


749 




parameter 


lwr_ 


.lt5= 


600 




parameter 


upr. 


_lt6 = 


899 




parameter 


lwr. 


_lt6= 


750 




parameter 


upr. 


_lt27 


= 4049; 


parameter 


lwr. 


_lt27 


= 3900; 


parameter 


upr. 


.lt28 


= 4199; 


parameter 


lwr_ 


_lt28 


= 4050; 


parameter 


upr. 


_lt29 


= 4349; 


parameter 


lwr. 


_lt29 


= 4200; 


parameter 


upr 


_lt30 


= 43 


92; 


parameter 


lwr. 


_lt30 


= 43 


50; 



// These parameters define the upper: and lower limits of the segments of the 
Group-test Hit list.// 

initial 
begin 

posl=upr_ltl ; 
pos2=upr_lt2 ; 
pos3=upr_lt3 ; 
pos4=upr_lt4 ; 
pos5=upr_lt5 ; 
pos6=upr_lt6 ; 



pos27 = upr_lt27 ; 
pos2 8=upr_lt2 8 ; 
pos29 = upr_l t29 ; 
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pos30=upr_l t3 0 ; 

decrmt_enbl 1=1; 
decrrat_enbl 2 = 1; 
decrrat_enbl3=l ; 
decnnt_enbl4=l ; 
decrmt_enbl 5=1 ; 
decrxnt_enbl6=l ; 
decrmt_enbl7 = l ; 



decnnt _enb 1 2 7=1; 
decm\t_enbl28=l ; 
dec nnt _enbl 29=1; 
decnnt_enbl30=l ; 



cl=0 
c2=0 
c3=0 
c4=0 
c5=0 
c6=0 



c27=0; 
c28=0; 
c29=0; 
c30=0; 

: mem_acce s s = 1 ,- 
end 

initial Sreadmemh ( "Fanout . dat " , Fan_out_hdr_tbL ) ; 

//The file Panout.dat contains the vectors for the start of the fan-out lists for 
every gate in the circuit being simulated.// 

initial Sreadmemh (• Fansize.dat " , Fan_out_size_tbl) ; 

//The file Panfliz«.dat specifies the size' of the fan-out list for each gate being 
simulated.// 

initial forever 
begin 

@(Reset_ctr) 

if (Reset, ctr) 
begin 

Num_hits=0 ; 

Hit_lst_buf f r=Grr_hit_list ; 

Tst_or_bit= |Grr_hit_list ; 

Sdisplay ( "OR Check=%b" ,Tst_or_bit) ; 

End_scan_f lag=0 ; 

Hit_f nd_f lag=0 ; 

Hit_fnd_ORed_f lg=l ; 

pos 1 =upr_l 1 1 ; 

pos2 =upr_l 1 2 ; 

pos3=upr_lt3 ; 

pos4=upr_lt4 ; 

pos 5 =upr_l 1 5 ; 

pos6=upr_lt6 ; . 



pos27=upr_lt27 ; 
pos28=upr_lt28 ; 
pos29=upr_lt29; 
pos30=upr_lt3 0 

decrmt_enbl 1 = 1 
decnnt_enbl 2 = 1 
decrmt_enbl 3 = 1 
decrmt_enbl 4 = 1 
decrmt_enbl 5 = 1 
decrmt_enbl 6 = 1 
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d ec rmt_enb 127=1; 
decrmt_enbl 28 = 1; 
decrmt_enbl2 9=1; 
dec rrot_enb 130=1; 



cl=0; 
c2=0; 
c3=0; 
c4=0; 
C5=0; 
c6=0; 

c27=0 
c28=0 
c29=0 
c30=0 



mem_access=l ; 
raem_access=l ; 

$display( "Initialisation seq executed' ) ; 
S t a r t _ t ime= $ t ime ; 
end 
end 

always @ (posedge Decrmt_enbl} 
begin 

Mpl_scan_enJbl=l ; 
end 

always ©(posedge Rset_hit_fnd_f Ig) 
begin 

Hit_fnd_flag=0; 
mera__access=l ; 
end 

always @ (negedge Clk) 
begin 

if ( ! End_scan_flag) 
begin 

Hit_fnd_ORed_f lg= | Hi t_lst_buf f r ,- 

if ( ! Hit_fnd_JDRed_f Ig) 
begin 

End_scan_f lag-1 ; 
Mpl_scan_erLbl = 0 ; . 
end 
end 



if ( (Mpl_scan_enJbl) < Hit_fnd_ORed_flg) ) 

begin 

if { de c rmt_enb 1 1 ) 
begin 

if (Hit_lst_buf fr [posl] ==l) 
begin 

Hit_lst_buf fr [posl] =0; 
dec rmt_enbl 1 = 0 ; 
if ( !roem_access ) 

begin 

cl=cl+l ; 

Sdisplay ("Clashl cl=%d",cl); 
end 

wai t (mem_access ) ; 
raem_access=0; 
Num_hits=Num_hits + 1; 

Fan.out^s i z e_reg = Fan_ou t_s i ze. tbl [ pos 1 ] ; 
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Fan_out_src_reg=Fan_out_hdr_tbl Iposl] ; 
Hit_fnd_f lag=l; 
Hit_lst_buf fr [posl} =0; 

if (posl >lwr_ltl) 
begin 

posl=posl-l; 
decrmt_enbl 1=1 ; 
end 

end 

else 
begin 

if (posl >lwr_ltl) 
begin 

posl=posl-l ; 
end 
else 
decrmt_enbl 1=0 ,- 

end 

end 



if ( decrmt_enbl3 0 ) 
begin 

if (Hit_lst_buf frlpos301 ==1> 
begin 

Hit_lst_buf fr [pos30] =0; 
decrmt_enbl3 0 = 0 ; 
if ( imem^access ) 

begin 

c30=c30+l; 

$display("Clash30 c30=%d- , c30) ; 
end 

wait (mem.access) ; 
mem_access=0 ; 

Num_hits=Num_hits + 1 ; 

Pan_out.size_reg=Pan - out_size_tbl (pos30) 
Fan_out_src_reg=Fan_out_hdr_tbl [pos30] ; 
Hit_fnd_flag=l; 
Hit_lst_buf f r [pos3 0 ] =0 ; 

if (pos30 >lwr_lt30) 
begin 

pos30=pos30-l ; 
decrmt_enbl30=l ; 
end 

end 

else 
begin 

if (pos30 >lwr_lt30) 
begin 

pos3 0=pos3 0-l; 
end 
else 

decrmt_enbl 3 0 = 0 ; 

end 

end 

end 



always @ (posedge End_scan_f lag) 
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begin 

F ini sh_t ime= S t ime ; 



end 
endnvodule 



Fan-out Generator module 

Description: When a hit has been detected in the Group-test Hit list The address within the scan 

JhTc«^»ri a V6C T C £T Fan -° Ut hdr teble > which *e start ofaTanTuuLtTor 

Uie current active gate. The address register of this module is loaded with the address of *l 
header of the fan-out list. The size of this fan-out list and the u^daTed s^ v^Te to^e 
transnutted is also conveyed to the module. The module proceeds to aSect all ch^g es t«J Tt an! 

module Pan_ou t _gen(Fan_out load.F a n_ou t _gen_flg,Reset_gen. Update val in 

Clock. Update_val_out.Fan_out_size_reg. ce -vaj._in. 
Fan_ou t_adr _r eg , Ou t_adr_r eg ) ; 

//The address in Fan out vector *-v.i .-i v. 

number of fan-out element a« Stained X F« *ZL aL"" "V"* liSt and the 
respectively These are 1 conca " lecl ln . ?M_out_adr_r 0 g and Faa_out_siza rag 

successive Lgatile edge(s} ox Clock ST ^ SitlV % ed ^ of r«x_o ttt _load. On "the 
Out_adr_reg. The end of a faLutli J ° f / fan " out " generated in 

This flag is cleared by the po^jve «JL when • P «-^-~-«9 is set. 

conveyed to the fan-out list is ^ trans fttr ed ^ R «^_gen . The signal value to be 
Opaat.. y al_in and Update Jal!out. SSSSfiy"/ transWlltted ^ «*• -*>le in 

parameter Vectr.tbl_wrd.size = 13 - 
parameter Vectr_tbl_size = 163 83;' 
parameter Inp_val_wdth=2 ; 
parameter Max_f an_out=7 ; 
parameter Vectr_tbl_adr_size=13 ; 

input Fan_out_load, Reset _gen, Clock - 

input [Inp_val_wdth: 0} Upda te_val_ln - 

input [Max_fan_out : 0] Fan_out__s ize recr - 

input [Vectr_tbl_adr_size:0] Fan.out.adr^reg; 

output Fan_out_gen_flg ; 
reg Fan__out_gen_f lg ; 

output [lnp_val_wdth:0] Update.val.out - 
reg [ Inp_val_wdth : 0 ] Update_val_out ; 

output (Vectr_tbl_wrd_size:0] Out.adr req- 
reg [ Vectr_tbl_wrd_size : 0 ] Out_adr_reg; 

reg[Vectr_tbl_wrd_size:0] Fan.out_vector_tbl [0:Vectr_tbl.size] ; 
reg [ Vectr_tbl_wrd_size : 0 ) List^pos ; 
reg [Max_f an_ou t : 0 J Counter ; 

initial $readmemh(" Fanvcr.dat". Fan_out.vector tbl) • 

//Zanycr.dat contains the vectors of the signals in' the fan-out lists for every 



initial forever 
begin 
@ IReset_qen) 
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if (Reset_gen) 



begin 
Fan_out__gen_f lg=0 ; 
end 



end 



always @(posedge Fan_out_load) 
begin 

if (!Reset_gen) 
begin 

Counter=Fan_ou t_s i ze.reg ; 
List_pos=Fan_out_adr_reg ; 
Update_val_ou t =Update_val_in ; 
Fan_out_gen_f lg=l ; 
end 
end 



always @(negedge Clock) . 
begin 

if (!Reset_gen && Fan_out_gen_f 1 g ) 
begin 

if ( Counters* 0) 
begin 

Out_adr_reg=Fan_out_vector_tbl [Li st_pos ) ; 
List_pos=List_pos+l; 
Count er=Counter- 1 ; 
end 
else 

Fan_ou t _gen_ f 1 g= 0 ; 



end 
end 
endmodule 
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Input-value Bank 

Description: The baonk contains the cwirenlt values off all the signals in the circuit. Each locate©© 
nnn the bank corresponds to a wfire. Snmce a word at amy location is 3 bifts wide, up to 8-vataed 
togic can be simulated (this cam be augmented by increasing the word width). The current vaDue 
off amy ware us shifted from (ffinis bank nnto AmrayJLb wben fae is incremented. This is dome fan 
parallel Only wnre vataes (lhatt have changed in (the currant time interval are updated. 

module Input_val_bank<Inp_val_reg, Adr_reg. Clock, Shf t.enbl, Wrt enbl 

Out_buf fr_reg) ; 



/lasjml.rag contains the new value of a signal (i.e. word) in lag, val arv The 
location of the wire is specified in &ar_ TO » and the write operati^lSes^f fSt 
on the negative edge of Clock if W * t _ Qa bl is asserted. If LtUl i s as 22 
then the right-most bit of every location is shifted into the 1 bit ™? ^ 
register ont.buffr^ on the positive edge of Clock. Ill sifted i are ^ 
written into the right -most bit of In» vai azy (i e a rotarinni • irti aLso 
values have been retained after the 2S£sTSt C ^ciJ /! r * ' UM 

parameter Inp__val_wdth=2 ; 
parameter Adr_reg_bits=13 ; 
parameter Inp_bnk_size=163 83 ; 
parameter Lsr7552_Inp_bnk_size=87 84 ; 

input Clock, Shft_enbl,Wrt_enbl; 
input [ Inp_val_wdth : 0 ] Inp_val_reg ,- 
input [ Adr_r eg_bi t s : 0 ] Adr^reg ; 

output [Inp_bnk_size:0] Out.buf fr^reg; 
reg [ Inp_bnk_size : 0 ] Out^buf f r.reg; 

reg [ Inp_val_wdth : 0 ] Inp_val_ary { 0 : Inp^bnk.size] ; 

reg [ Inp_val_wdfch : 0 ] Temp_reg ; 
reg Temp_bi t ; 

integer Inp_ary__indx , i ; 

initial $readmemb ( - Inpval . dat - , Inp_val^ary) ■ 

//Ia^raX.dat is the file which initialises the current input values of all aates 
in the simulated circuit. All values are assigned 'Unknown' logic values exceot 
those primary inputs which are assigned logic '0' or *1'.// except 

always ©(posedge Clock) 
begin 

if (Shft_enbl) 
, begin 

for (Inp_ary_incbc=0; Inp_ary_indx<=Lsr7552_Inp_bnk_size ; 

, Inp_ary_indx=Inp_ary_indx+l) 
begin 

Teinp_reg=Inp_val_ary [ Inp_ary_indx] ; 
Temp_bit=Temp_reg [ 0 ] ; 

Out_buff r _reg [ Xnp_ary_indx) =Temp_bi t 
Temp^reg [1:0] =Temp_reg [ Inp_val _wdth : 1 ] ; 
Temp_reg [ Inp_val_wdth ] =Temp_bit ; 
Inp_val_ary ( Inp_ary_indx J = Temper eg ; 
end 

$display ( - (shf t ) time = %d" , $time) ; 

end 



else 
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if (Wrt_enbl) 
begin 

Inp_val_ary ( Adr_reg ] = Inp_val_reg ; 
end 

end 
endmodule 

The Sequence Logic of the APPLES Processor 

parameter Nibl = 3 ; 
parameter Ary_la__wdth=7 ; 
parameter Ary_lb_adr_reg_wdth=13 ; 
parameter Ary_la_size-163 83 ; 
parameter Ary_lb_size=16383 ; 
parameter Eval_ptrn_tbl_size=63 ; 
parameter Eval_j>trri_vctr_tbl_size=31 ; 
p arame ter Num_t s t_wdt h=7 ; 
parameter Num_tst _ptrn_tbl_size=3 1 ,- 
parameter Gate_maskla_tbl_size=31 ; 
parameter Gate_inptla_tbl_size=31 ; 
parameter Trr_ptrn_tbl_size=3 1 ; 
parameter Grr_ptm_tbl_s i ze= 3 1 ; 
parameter Out__val_tbl__size=31 ; 



parameter Wlr__wrdsize=31 ; 
parameter Trr_wdth_spec= 2 ; 
parameter Trr_word_size=7 ; 
parameter Grr_mem_s i ze= 8 1 9 1 ; 
parameter Grr_wdth_spec=2 ; 
parameter Grr_word.size=7 ; 
parameter Iu_word_size=7 ; 
parameter Iu_wdth_spec=2 ; 
parameter Vectr_tbl_adr_reg=13 ; 
parameter Max_f an.out=7 ; 
parameter I np_va 1 __wd t h= 2 ; 
parameter Vectr_tbl_adr__size=163 83 ; 

parameter Inde3c__reg_wdth=7 ; 

parameter Num - tst - seq=12 ; //No of gates X No Transitions 
parameter Num_tst_cnt_wdth=3 ; 
parameter Irii t_shf t_va 1=3 ,- 
parameter Shf t_cnt_wdth=3 ; 

wire Clock; 

wire ( Ary_la_size : 0) Wrd_ln_activ_ls t , Tr r_bnk_inp_reg ; 

wire f Ary_lb_size : 0 ] Inval_umt_out_reg ; 

wire [Grr_mem_size : 0 ] Grr_bnk__inp_reg , Grr_bnk_hit_lst ; 

wire [Max_f an_out : 0) Mrr_unit_fan_out_size_reg; 

wire ( Vectr_tbl_adr_reg : 0] Mrr_unit.fan_out_src.reg ; 

wire ( I np_va 1 _wd t h : 0 ] Fo_gen_unit_val_out ; 

wire [ Vectr_tbl_adr_size : 0 ) Fo_gen_uni t__out_adr_reg ; 

reg Tst_seq_strt ; 

reg eO , el , e2 , e3 , e4 , e5 , e6 , e7 , eB , e9 . elO , el 1 , el2 , el 3 , el4 , 

el5.el6,el6a,el6b.el7,el8,el9.e20.e21,e22,e23 , e24 , e25 , e26 . e27 , e28.e29, 
Deact_srchla . Gate_eval_init_proc ; 

reg ( Index_reg_wdth : 0 1 Ept_i , Epvt_i , Ntpt_i , Gmlat_i , Gilat_i , 

Tpt_i , Gr i t_i , Grmt_i , Ovt_i ; 

reg[Wlr_wrdsize: 0) Eval_jptrn_tbl (0 : Eval_ptrn_tbl_size ] ; 
regtWlr_wrdsize : 0 ) Eval_ptrn_vctr_tbl [ 0 : Eval_ptrn_vctr_tbl_size) ; 
reg [Num_ts t_wdth : 0 ] Num_tst_ptrn_tbl ( 0 : Num_tst_ptrn_tbl_size] ; 
reg(Ary_la_wdth: 0) Gate_maskla_tbl [0 :Gate_maskla_tbl_size) ; 
reg[Ary_la_wdth: 0] Gate_inptla_tbl [ 0 :Gate_inptla_tbl_size] ; 
reg|Trr_word_size:0] Trr_ptrn_tbl [ 0 : Trr_ptrn_tbl_size] ; 
reg(Grr_word_size: 0] Grr_inpt_tbl [ 0 :Grr_ptrn_tbl_size) ; 
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reg(Grr_word_size:0] Grr_niask_tbl [ 0 : Grr_ptrn tbl size] - 
reg [ Inp_val_wdth : 0 J Out_val_tbl 1 0 : Out_val_tbl_sIze ] ; 

reg tGrr_word_size : 0 ) Grr_bnk_search_reg, Grr_bnk_mask_reg ; 

reg { Grr_wdth_spec : 0 ] Grr_bnk_wr t _pos ; 

reg [ Trr_wdth_spec : 0 J Trr_bnk_wrt _pos 

reg[Trr_word_size : 01 Trr.rslt.act.reg.Trr.rslt.act.and.O; 
reg [ Iu_word_size : 0 ] Inval_unit_adr_reg ; 

reg [ Iu_wdth_spec : 0 ] Fo_gen_unit_val_in , Inval_unit_in_reg ,- 

Inval_unrt_shft_enbl. I nval_unit_ W rt_e^bl; ^-^"-"et. 
reg[Ary_la_wdth:0) Inp_regla, Mask_regla. Adr_regla; 
reg[Wlr_wrdsize: 0] Inp_reg_lb, Search_reg_lb,Mask_reg_lb; 
r-eg [Ary_lb_adx_reg_wdth : O ] Adr_reg_lb; 
reg[Num_tst_cnt_wdth:01 Num_ts t_cnt ; 
reg[Shft_cnt_wdth:01 Shft_cnt; 

Ary_la <^te_id_bnk ( Inp_regla . Mask.regla . Adr_regla . Clock 

Search_ary_la.Write_enbl_la. Wrd_ln_activ_lst) ; 

Ary_lb Wrd_ln_reg_bnk(Search_reg_lb. Mask_reg_lb. Adr_reg_lb 

Inp_reg_lb,Out_reg_lb.Trr_bnk_inp_reg,'shft ary lb 

Clock,Inval_uxnt_out_reg.Wrd_ln_activ_l st )7 ~ 

TSt _rsl t _reg_bank Trr_bn*(Trr bnk ^^inp^eg.Trr.bnk.wrt.enbl . Trr_bnk_c 0ln b_enbl . 

Clock, Grr__bnk_xnp_r eg, Trr.rslt.act reg, 
Trr_bnk_wrtjpos,Trr.bnk_rset) ; ^ 

Grp_rslt.reg.bank Grr.bnk ( Grr.bnk.inp.reg , Grr.bnk.maskLreg . 

Grr_bnk_search_reg, Clock. Grr^bnk.s ear ch enbl 
Grr_bnk_wrt_enbl , Grr.bnk^wrt^os , Grr_bnk_hi t.lst ) ,- 

Mul tiple.res.res Mrr.unit (Grr.bnk.hi t_ls t , Clock, Mrr.unit.rset . 

Mrr_unit_end_scan_flg , Mrr.uni t.decnrit enbl . 
Mr r _un 1 1_ f an.ou t_src_reg , 
M** 1 *— unit.f an_put_size_reg # 
Mrr_unit_rset_hit_fnd_f lg, 
Mrr_unit_hit_fnd_f lag) ; 

Fan -° Ut - gen FO - gen - unit <F°^^ rset 

Fo.gen_unit.val.in , Clock, Fo_gen_unit_val_out"" 
Mrr.unit.fan^out^size.reg^Mrr.unit.fan out src req 

Input.val.bank Inval.unit <Fo^gen__unit_val_out , Fo.gen.unit.out.adr reg Clock 

lnval_unit.shft.enbl , Inval.unit_wrt.enbl, ' 
Inval.unit.out.reg) ; 

Ck_gen Clk.unit (Clock) ; 



integer i . Tst.num, i terpen t; 
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initial 



fc>egin 

Sdisplay ( "Initialisation commencing. ■ ) ; 
Sreadmemb ( ■ Ep_tbl . dat " , Eval_ptrn_tbl ) ; 
Sdisplay ( "Ep_tbl.dat loaded. ") ; 
Sreadmemh ( " Epv_tbl . dat ■ , Eval_ptrn_vctr_tbl ) ; 
Sdisplay ( "Epv_tbl.dat loaded."); 
Sreadraemh ( "Ntp_tbl . dat ■ , Num_tst_ptm_tbl ) ; 
Sdisplay ( "Ntp_tbl . dat loaded. " ) ; 
Sreadmemb ("Gila_tbl .dat" , Gate_inptla_tbl ) ; 
Sdisplay ( "Gila_tbl . dat loaded . ■ ) ; 
Sreadmemb ( *Gmla_tbl . dat " , Gate_ raaskla_tbl ) ; 
Sdisplay ( "Gmla.tbl . dat loaded . ■ ) ; 
Sreadmemb ( "Tp_tbl . dat " , Trr _ptrn_tbl ) ; 
Sdisplay ( "Tp_tbl .dat loaded. " ) ; 
Sreadmemb ( "Gi_tbl . dat " , Grr_inpt_tbl) ; 
Sdisplay ( °Gi_tbl .dat loaded. ■ ) ; 
Sdisplay ( *Gi_tbl .dat loaded. ■ ) ; 
. Sreadmemb ( "Gm_tbl . dat " ; Grr_mask_tbl ) ; 
Sdisplay ( "Gm_tbl .dat loaded. n ) ; 
Sreadmemb (-Ov_tbl.dat" , Out_val_tbl ) ; 
Sdisplay ( M Ov_tbl .dat loaded. " ) ; 

Sdisplay ( "Table initialisation sequence completed" ) ; 

Gate_eval_init_proc=l ; 
iter_cnt=0 ; 

Nuin_tst_cnt=Num_tst_seq; 
Inval__unit_shf t_enbl=0 ; 

Ept_i = 8 1 hOO ; Epvt_i=8 1 hOO ; Ntpt_i=8 * hOO ; 
Gmlat_i=8 , h00; Gilat_i=8 ' hOO ; Tpt_i=B'hOO; 
Grit_i=8 • hOO ; Grmt_i=8 1 hOO ; Ovt_i = 8 • hOO ; 
end 

always @(negedge Clock) 
if (Gate eval init proc) 



Sdisplay ( •Gate.eval_init.proc £ time=%d" , Stime) ; 
iter_cnt=iter_cnt+l ; 

Sdisplay ( "Iteration count=%d" .iter.cnt) ; 

Gate_eval_initjproc=0 ; 

Deact_srchla=0; 

e0=0 ; el=0 ; e2=0 ; e3=0; e4=0 ; e 5 = 0 ; e6=0; 
e7=0; e8=0; e9=0 ; elO=0 ; ell=0; el2=0; el3=0; 
el4=0; el5=0; el6=0; el6a=0; el6b=0; el7=0; 
el8=0? el9=0; e20=0; e21=0; e22=0; 

Inp_regla=Gate_inptla_tbl [Gilat_i] ; 
Mask_regla=Gate_maskla_tbl (Grolat_i] ; 
Tst_ num=Num_ tst _ptrn__tbl (Ntpt_i ] ; 
Ept_i=Eval_ptrn_vctr_tbl [Epvt_i] ; 
Mrr_uni t_decrrot_enbl = 0 ; 
Tst_secj_strt=l ; 
Wlr_bnk_search_enbl=0 ; 
lnval_unit_wrt_enbl = 0 ; 
end 

always @ (posedge Clock) 
begin 

if (Tst_seq_strt) 
begin 

Trr_bnk_rset = l ; 
Search_ary_la= 1 ; 
eO=l; 

Tst_seq__strt=0 ; 
end 
end 



begin 
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always ^(negedge Clock) 
begin 
if (eO) 
begin 
e0=0; 

Deact_srchla=l ; 
end 
end 



always S(posedge Clock) 
begin 

if (Deact_srchla) 
begin 

Trr_bnk_rset=0 ; 
Deact_srchla=0; 
Search_ary_la=0 ; 
•1-1; 

i =Trr_word_s ize; 
end 
end 



always ©(negedge Clock) 
begin 
if (el) 

begin 

el=0; 

e2=l; 

end 
end 



always ©(posedge Clock) 
begin 
if (e2) 
begin 

Wl r_bnk_s earc h_enb 1=1 ; 
Search_reg_lo=Eval_ptrn_t:bl [ Ept_i } ; 
Mask„reg_lb=Eval_ptrn_tbl [Ept i + 1 \ ' 
e2=0 ; ~ 

e3=l; 
end 
end 

always ©(negedge Clock) 
begin 
if (e3) 

begin 

e3=0; 

e4=l; 

end 
end 



always @(posedge Clock) 
begin 
if <e4) 
begin 

Tr r_bnk_wr t_enb 1 = 1 ; 

Trr__bnk_wrt_pos = i ; 

Wlr_bn)c_search_enbl=0 ; 
e4=0; 

e5=l; 

end 

end 
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always ©(negedge Clock) 
begin 
if (e5) 

begin 

e5=0; 

e6=l; 

end 
end 



always @(posedge Clock) 
begin 
if (e6) 
begin 

Tst_num=Tst_nuin-l ; 
i=i-l ; 
e6 = 0; 

if |Tst_num> 0 ) 
begin 
el=l; 

Ept_i=Bpt_i +2 ; 

$display( "Ept_i (updated) =%d" ,Ept_i) ; 
Trr_bnk_wr t _enbl = 0 ; 
end 
else 

begin 

Tr r_bnk_wr t _enb 1=0; 
i =Trr_word_s i z e ; 

Trr_rsl t_ac t_reg=Trr_ptm_tbl [Tpt_i] ; 
Ts t_nuro=Num_t s t_p trn_tbl {Ntpt_i ) ; 
e7=l ; 
end 

end 
end 

always @ (negedge Clock) 
begin 
if (e7) 

begin 

e7=0; 

e8=l; 

end 
end 



always @ (posedge Clock) 
begin 
if <e8) 
begin 

Trr_bnk_c omb_enb 1=1; 
Trr_bnk_wrt_j>os=i ; 
e8=0; 
e9 = l; 

$display ( "Commencement of TRR tests for Gate type=%b* , Inp_regla, *at 
time=%d" , $time) ; 

end 
end 

always @ (negedge Clock) 
begin 
if (e9) . 

begin 

e9=0; 

el0=l; 

end 
end 
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always @(posedge Clock) 
begin * 
if (elO) 
begin 

Trx_bnk_comb_enbl=0 ; 

Grr_bnk_wrt_enbl=l ; 

Grr_bnk_wrt_pos=i ; 

el0=0; 

ell=l ; 

end 

end 

always @(negedge Clock) 
begin 
if (ell) 
begin 

el 1*0; 

end 
end 

always @ (posedge Clock) 
begin 
if <el2) 
begin 

Ts t_num=Ts t_num- 1 ,- 

i=i-l; 

el2=0; 

if (Tst_num>0) 
begin 
e9=l; 

Tr r_bnk_c omb_enbl = 1 ; 
Trr_bnk.wrt_pos = i ; 
Grr_bnk_wrt_enbl = 0 ; 
end 
else 
begin 
el3=l; 

Gnr_bnk_wrt_enbl=0 ; 
end 

end 
end 

always @ (negedge Clock) 
begin 
if (el3) 

begin 

el3=0; 

el4=l; 

$display( -Termination of Trr tests for Gate type=%b» , Inp regla - 
time=%d" , $time) ; "~ 

end 
end 



always @ (posedge Clock) 
begin 
if (el4) 
begin 

Grr_bnk_search_reg=Grr_inpt_tbl lGrit_i) ; 

Grr_bnk_mask_reg=Grr_n\ask_tbl [Gnnt_i) ; 

Grr_bnk_search_enbl = l ; 

Fo_gen_uni t_rse t = 1 ; 

el4=0; 

el5=l; 

end 

end 
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always (Mnegedge Clock) 
begin 
if (el5) 

begin 

el5=0; 

el6=l; 

end 
end 

always @ (posedge Clock) 
begin 
if (el6) 
begin 

Mrr_junit_rset=l ; 
el6=0; 
el6a=l; 
end 
end 

always @ (negedge Clock) 
begin 
if <el6a) 
begin 

Mrr_unit_rset=0 ; 
el6a=0; 
el6b=l; 
end 
end 

// Propagate values to gates affected in fan_out lists 

always @ (posedge Clock) 
begin 
if <el6b) 
begin 

Grr_bnk_search_enbl=0 ; 
Mrr.uni t_decrmt__enbl = 1 ,- 
Fo.,gen, unit. rset=0 ; 

Fo_gen_\znit_val_in=Out_val_tbl [Ovt_i] ; 

el6b=0; 

e!7=l; 

Sdisplay ("Start of fanout list at tin\e=%d" , $time ) 
end 

end 

always £ (negedge Clock) 
begin 
if (e!7) 
begin 

Fo_gen_unit_load= 0 ; 
el7=0; 
e!8=l; 
end 
end 



always @ (posedge Clock) 
begin 
if (el8) 
begin 

if (Mrr_unit_hit_f nd_f lag) 
begin 

Fo_gen_unit_load=l ; 

e!8=0; 

el9=l; 
end 
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else 

if( ieSn- Unit ~ hit " fnd - flag> * ^^i^end.scan.f lg) , 
ei8=0; 
e22=l; 

Mrr_unit_decnnt_enbl=0 ; 
end 



end 

end 

always @<negedge Clock) 
begin 
if (el9) 
begin 
Fo_gen_unit_load=0 ; 
Inval_unit_wrt_enbl=l ; 
Mrr_unit_r set_hi t_f nd_f 1 g= o ; 
el9=0v 
e20=l ; 
end 
end 

always @ (posedge Clock) 
begin 

if (e20) 
begin 

if ( ! Fo_gen_\mit_f lg ) 
begin 

if (! Mrr__unit_end_scan fig) 
begin 

Mrr_unit_xset_hit_fnd_flg=i ; 
Inval_uni t_wrt enbl=0 • 
e20=0; 
e21=l ; 
end 

else 

begin 

Inval_unit_wrt_enl>l=o - 
e20=0; 

e22=l; 

end 

end 

end 

end 

always @(negedge Clock) 
begin 
if <e21) 
begin 
el8=l ; 
e21=0; 
end 
end 

always @ (negedge Clock) 
begin 
if (e22) 
begin 
e22=0; 
e23=L; 

EpVt^i = Epvt.i + l; Ntpt_i=Nt P t_Ul; 

Gmlat_i=Gmla<L_Ui ; Gilat_i=Gilat_i + l ; 
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Tpt_i=Tpt_i + l ; 

Grit_i=Grit_i+l; Gnnt_i=Grmt_i + l ; 
Ovt_i =Ovt_i ♦ 1 ; 

$di splay (-Termination of Fan out update, time=%d" , $time) ; 
end 
end 

always @(posedge Clock) 
begin 
if (e23) 
begin 
e23=0; 

Num_tst_cnt=Num_tst_cnt-l ; 

if (Num_tst_crit==0 ) 

begin 

e24=l; 

end 
else 

Gate_eval_init_proc=l ; 
end 
end 

always ©(negedge Clock) 
begin 
if (e24) 
begin 

$display ( "E24 attained. End of fanout update. " ) ; 

Sdisplay ( " ) ; 

Inva 1 _un i t_s h f t_enb 1=1 ; 
Shf t_cnt = Ini t_shf t _va 1 ; 
e24=0; 
e25=l; 
end 
end 

// Input_val_bank is tve edge triggered. Thus next block is -ve edge. 

always @ (posedge Clock) 
begin 
if (e25) 
begin 

$ display ( "E25 attained ■ ) ; 
Shf t_ary_lb= 1 ; 
e25=0; 
e26=l; 
end 

end 

always @(negedge Clock) 
begin 
if (e26) 
begin 

$display( *E26 attained w ); 
Shf t_cnt=Shf t_cnt-l ; 
if (Shf t_cnt==0) 
begin 
e26=0; 

Inval_unit_shf t_enbl=0; 
e27=l; 
end 

end 
end 

always ©(posedge Clock) 
begin 
if (e27) 
begin 
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. Shf t_ary_lb=0 ; 
e27=0; 
e28=l ; 
end 

end 

always ©(negedge Clock) 
begin 
if (e28) 
begin 
e28=0; 
e29=l ; 
end 
end 

always ©(posedge Clock) 
begin 
if (e29) 
begin 

Gate_eval_init_proc=l ; 
Num_tst_cnt=Num_tst_seq ; 

Ept_i = 8'hOO; Epvt_i = 8'h00; Ntpt i=8'h00- 

;^or 8,hO0; Grmt ^= 8 --^; ivS=i.joS? - 

end 
end 



endmodule 
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The APPLES architecture is designed to provide a fast and flexible mechanism for 
logic simulation. The technique of applying test patterns to an associative memory 
culminates in a fixed time gate processing and a flexible delay model. Multiple 
scan registers provide an effective way of parallelising the fan-out up-dating 

5 procedure. This mechanism eliminates the need for conventional parallel 
techniques such as load balancing and deadlock avoidance or recovery. 
Consequently, parallel overheads are reduced. As more scan registers are 
introduced, the gate evaluation rate increases, ultimately being limited by the 
average fan-out list size per gate and consequently the memory bandwidth of fan- 

10 out list memory. 

Referring to Fig. 8, there is illustrated an array indicated generally by the reference 
numeral 20 comprising a plurality of cells 21 , each of which comprises an APPLES 
processor as described above. A synchronisation logic control 22 is provided. The 

15 circuit that is to be simulated is split up among the APPLES processor. Gate 
evaluations are carried out independently in each processor or cell 21. Each cell 
21 is provided with a local input value register bank and a foreign input value 
register bank to allow interconnection which is done through an interconnecting 
network 23 incorporating the synchronisation logic 22. Connections between the 

20 synchronisation logic circuit 22 which is, strictly speaking, the main synchronisation 
logic circuit, to each of the cells 21 is not shown. 

After all gate evaluations for all gate types and the corresponding updates have 
occurred, on a given processor forming a cell 21, the processor must wait for all 

25 other processors to reach the same state. When all processors reach this state 
then the respective input value register banks can be shifted into the respective 
array and associative register 1 b and evaluation of the next time unit can occur. 
Thus, to achieve implementation, there is required that a suitable interconnecting 
network must be designed and an interface to the APPLES processor constructed. 

30 A synchronisation method must exist to determine when evaluation of the next 
time unit should proceed. A system to split the hit list information amongst the 
processors is required in order to initialise the system. 

The array of processors is implemented as a torus (equivalent to a 2D mesh with 
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wrap-around) as shown in Fig. 8. The inclusion of wrap-around connections 
reduces the network diameter increasing the network speed. It also means that 
each processor can be identical without wasted hardware at the edges of the array. 
It does however require a more complicated routing mechanism. No set size was 
5 used for the array instead the size was used as a criteria which was varied during 
simulations. This criterion was specified by a command line parameter to the 
Verilog compiler. These command line parameters are covered in detail in the next 
chapter. 

10 Each cell is connected to its four neighbouring cells via seriah connections. 
Obviously parallel connections would be faster. However a Virtex FPGA was used 
and it has a limited number of pins. It may happen that not all of these pins are 
available to a particular design due to the FPGA architecture. Pins are therefore a 
precious resource. Since each FPGA would require eight parallel connections (an 

15 input and an output connection on each of the four edges) this would require a 
large number of pins. If at a later stage it is discovered that there are spare pins 
and a parallel network is justified then the design could be altered. In this design 
each cell has a serial input and a serial output on each of its four edges. These 
serial connections each consist of a data line and two control lines. These serial 

20 connections will therefore require 12 pins on each Virtex FPGA. Each cell is also 
connected to the array's synchronisation logic. 

In order to design the network knowledge of the information that the network must 
carry is required. The network is required in order to pass fan out updates between 

25 processors. These updates can be passed as messages. Each message is an 
update and consists of a destination address and an update value. A single Virtex 
FPGA was used to implement an APPLES processor capable of simulating a circuit 
with approximately 256 gates. This figure is somewhat arbitrary and further design 
work will reveal the true value required. Given a restraint of 256 gates per 

30 processor approximately 64 processors would be required to simulate a reasonably 
complex circuit. This corresponded to an 8 x 8 array. Each processor will need to 
be able to send updates to any other processor updating any one of their 512 gate 
inputs. This implies an address space of six to identify the processor and an 
address space of nine to identify the wire. Each update sent also requires an 
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update value. These are three bits wide (enabling support for eight-state logic). 
Therefore messages sent from processor to processor will need to be eighteen bits 
wide. These figures are arbitrary but are a useful starting point. 

5 The structure of a cell 21 is shown in Fig. 9. Each of the four edges has a 
transmitter 25 and a receiver 26. These modules deal with the serial connections. 
The transmitter 25 takes in an eighteen-bit entity and sends it out in a bit stream. 
The receiver 26 takes in the bit stream and reconstitutes it into the original 
eighteen-bit message. 

10 

A request scanner 27 checks every receiver 26 and the APPLES processor 30 
simultaneously to see if they have messages waiting to be routed. It assigns each 
of these sources a rotating priority and picks the source that has a message and 
the highest priority. It then passes the picked message to a request router 28. 

15 

The request router 28 passes its messages either to the APPLES processor 30 or 
to a transmitter 25. If the option chosen is a transmitter then the message will be 
sent to a different cell 21. If the option chosen is the APPLES processor 30 then 
the message is an update for the local processor. A synchronisation logic circuit 31 
20 controls the cell 21 through the synchronisation logic circuit 22. 

In Fig. 9 every transmitter, every receiver and the input and output ports of the 
APPLES processor have buffers connected. A command line parameter to the 
Verilog compiler specifies whether these components are to be used or removed 
25 from the design. One slightly different behaviour of these buffers is that they 
process data in a LIFO fashion. The effect of these buffers on performance is an 
important part of the system analysis. 

The request router 28 employs one of two different routing techniques. The 
30 technique used is determined by a command line parameter to the Verilog 
simulator used to implement the invention. A comparison of the routing techniques 
is important to the understanding of the invention. Both routing techniques operate 
in a similar manner. 
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The request router 28 decodes the message. It can then determine the destination 
processor. It determines all the valid options for routing the message. The 
message could be routed to the local APPLES processor 30 or to one of the 
transmitters 25. The message is then routed to one of the valid options. 

5 

The first routing technique only produces one valid routing option and if that route is 
not blocked then the message is routed in that direction. If it is blocked then the 
request router 28 attempts to route a different message. Messages are passed 
from cell 21 to cell 21 until they reach their destination. Under this routing 

10 technique a message is passed first either in the east or west direction until it is at 
the correct east-west location. It is then routed in the north or south direction until 
the message arrives at its destination. The net result of the message passing is 
that the message travels the minimum distance. This routing strategy results in 
the traffic between any two given cells 21 always following the same route through 

15 the network. This routing strategy can be called standard routing. 

The second routing technique is more complicated. Under this strategy the request 
router 28 determines all of the available directions that can be taken by the 
message which will result in it travelling the shortest distance. The various options 
have different priorities associated with them. This priority is based on the options 
that were previously taken. This priority method helps to use the various routes 
evenly and therefore efficiently. Some of the options may not be feasible as they 
may be in use with previous messages. An option is chosen based on priority and 
availability. The priority information is then updated. This routing strategy is an 
advanced routing. 

For both routing techniques, when all valid paths are blocked and the request 
router 28 is unable to route its message then it simply drops the message. This is 
an important aspect to the manner in which the request scanner 27 and request 
router 28 work together. The request scanner 27 takes a message from one of its 
sources. It does not inform the source that it is attempting to route this message. 
The source maintains the message at its output. If the request router 28 
successfully routs the message then it tells request scanner 27 that it has done so 
and the request scanner 27 informs the source. This way the request router 28 is 
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not committed to routing a particular message. The request router 28 therefore is 
always free to attempt to route messages. 

The network interface 42 shares access to the input value register bank 20 
5 between the local processor and the network. The local processor gets priority. 
This module decodes the message and updates the appropriate location in the 
input value register bank 2. 

The network interface 42 is connected between the fan out generator 43 and the I 
10 Input value register bank 2. It can therefore pass fan out updates from the 
processor to the network when appropriate or simply pass them to the input value 
register bank 2. It can also pass fan out updates from the network to the input 
value register bank 2. Some changes were required in the fan out generator 43 to 
accommodate the network interface 42. 

15 

When each processor in the array has processed the fan out list for each of its 
active gates and all updates have reached their destination then each processor 
can shift its input value register bank 2 into its array 1b and proceed with evaluation 
of the next time unit. In order to achieve this some synchronisation logic, between 

20 the cells 21 , is required. The implementation for this requires each processor to 
report to its cell 21 when it has completed sending updates. Each cell 21 also 
monitors the network activity and reports back to the array stating whether there is 
network activity or processor activity. The array therefore knows when all 
processors are finished updating and when the network is empty. At such a time 

25 the array reports back to the cells 21 . Then the cells 21 tell the processors to 
proceed with the next time unit in the delay model. The implementation of this 
system required minor changes in the sequence logic of the APPLES processor. 

The network is not used to communicate this synchronisation information. Instead 
30 dedicated wires are provided. Each cell 21 has a finished input wire and a finished 
output wire. The cell 21 holds the finished output wire high when its processor has 
finished and no network activity is occurring around the cell 21. The finished input 
wire is controlled by the array synchronisation logic. The array holds it high when it 
detects that all the finished output wires are high at the same time. It would be 
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possible to use the network to communicate this synchronisation information. This 
would reduce the number of Virtex pins required by the design. However the 
synchronisation logic would be more complex and require more circuitry. The 
synchronisation process would also take longer to execute. 

5 

The information pertaining to the circuit description is stored in five memories within 
an APPLES processor. Under the basic APPLES Verilog design these memories 
are loaded from data files using the SREADMEM system command. For the 
system to be implemented on a Virtex chip these memories could be loaded via a 
10 PCI interface. 

Under the APPLES array each processor evaluates part of the circuit to be 
simulated. The contents of these five memories need to be split among the 
processors in the array. The memory contents also need to be processed in order 

15 to make it compatible with the array design. Under an implementation using an 
array of virtex chips this data could be loaded via a PCI bus and distributed using 
the array network. The data would be pre-processed for the array and each 
processor would simply need to load the data into its memories. The incorporation 
into the design of a system to distribute this data is non-trivial. This project is 

20 mainly concerned with the analysis of the array design's ability to simulate circuits. 
An analysis of the array's initialisation system is not of paramount importance at 
this time. As a result the initialisation system was not designed. 

In order to initialise the design, to facilitate simulating circuits, a Verilog task was 
25 written to load the memories. The single processor circuit description files are 
loaded into a global memory in the design. Each processor in the array is assigned 
a number. A processor's number is calculated by multiplying its y co-ordinates by 
the array width and adding its x co-ordinates. Each processor loads a segment of 
the global Array 1a, Array 1b. the fan out header table and the fan out size table 
30 into its local memory. These segments are of equal size. The segments chosen 
are based on their processor number. Processor zero takes the first segment, 
processor one takes the second segment and so on. A segment of the fan out 
vector table must be loaded also. The segment is determined by looking at the 
contents of the local fan out size and fan out header tables. The first address to be 
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loaded from the global fan out vector table is the address stored in the first location 
in the local fan out header table. The last address to be loaded is calculated by 
adding the address stored in the last entry in the local fan out header table to the 
last fan out size stored in the final entry in the local fan out size table. The 
5 addresses within the fan out header table must be adjusted to point at the new 
local fan out vector table. This is achieved by subtracting the address stored in the 
first location in the local fan out header table from each address in the same table. 
Each gate input address stored in the local fan out vector table must be converted 
into an array address. An array address consists of the destination processor's x 
10 co-ordinates stored in bits fourteen to twelve, the destination processor's y co- 
ordinates stored in bits eleven to nine and the gate input's local address on the 
destination processor stored in bits eight to ten. 

Using this system the circuit description is split among the processors. No 
15 consideration is given to decide which gate is simulated on which processor. The 
APPLES circuit description files determine where each gate is simulated. The 
layout of these files is determined by the layout of the iscas-85 net list files that 
were used to generate the APPLES circuit description files. 

20 Referring to Fig. 10, there is illustrated an alternative layout of processor in which 
parts similar to those described with reference to Fig. 1 are identified by the same 
reference numerals. In this embodiment, the scan registers are identified by the 
reference numerals 6a and the general logic sequence is identified by the reference 
numeral 40. The processor will also include a circuit splitting logic circuit 41 and a 

25 network interface 42. A fan out generator 43 is identified and will include, for 
example, the fan out memory 8. The network interface 42 shares access to the 
input value register bank 2. 

The original APPLES design is written in Verilog. So is the array design. The 
30 Verilog code is written at a behavioural level. This is the most abstract level 
available to a Verilog programmer. As with any Verilog system it is split into Verilog 
modules. Each module is a component of the system. The Verilog modules added 
under the APPLES array design are: 
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° The Top Module 
° The Array Module 
■ The Cell Module 
° The Receiver Module 
5 • The Transmitter Module 

° The Request Scanner Module 
° The Request Router Module 
° The Buffer Module 
° The Network Interface Module 

0 

The Top module is used to test that the system is performing correctly. An 
instantiation of the Top module contains an instantiation of the array module. The 
array contains multiple instantiations of the Cell module. Each Cell contains four 
instantiations of both the transmitter and Receiver modules. A Cell also contains a 
5 Request Scanner, a Request Router, several buffers and an APPLES processor. 
The APPLES processor contains instantiations of the standard processor 
components along with an instantiation of the Network Interface module. This 
structure and the behaviour of these modules were described earlier in this 
chapter. Each of these modules is contained within an appropriately named file. 

D 

In addition to designing these modules the array design also required the following 
changes: 

• The introduction of a Verilog task to split the circuit description information 
5 among the processors in the array. This is located in the APPLES processor 

module. 

• The incorporation of processor synchronisation logic into the APPLES 
processor module, the Cell module and the Array module. 

• The integration of the Network Interface module into the APPLES processor. 

) 

The APPLES architecture incorporates an alternative timing strategy which obviates 
the need for complex deadlock avoidance or recovery procedures and other 
mechanisms normally part of an event-driven simulation. The present invention has 
an overhead which is considerably less than conventional approaches and permits 



WO 01/01298 




PCT/IEOO/00083 



-63- 



gate evaluation to be activated in memory. The reduction in processing overheads is 
manifest in improved speedup performance relative to other techniques. 

A message passing mechanism inherent in the Chady-Misra algorithms has been 
5 replaced by a parallel scanning mechanism. This mechanism allows the fan- 

out/update procedure to be parallelised. As clashes occur gates are effectively put 

into a waiting queue which fills up an fan-out/update pipeline. Consequently as the 

pipeline fills up(with the increase number of scan registers), performance increases. 

The speedup reaches a limit when the new gates entering the queue equals the fan- 
10 out rate. Nevertheless, the speedup and the number of cycles per gate processed is 

considerably better than conventional approaches. The system also allows a wide 

range of delay models. 

The bit-pattern gate evaluation mechanism in APPLES facilitates the implementation 
1 5 of simple and complex delay models as a series of parallel searches. Consequently, 
the evaluation process is constant in time, being performed in memory. Effectively, 
there is a one to one correspondence between gate and processor (the gate word 
pairs). This fine grain parallelism allows maximum parallelism in the gate evaluation 
phase. Active gates are automatically identified and their fan-out lists updated 
20 through scanning a hit-list. This scanning mechanism is analogous to 
Communication overhead in typical parallel processing architectures, however, this 
scanning is amenable to parallelisation itself. Multiple scan-registers reduce the 
overhead time and enable the gate processing rate to be limited solely by the fan-out 
memory bandwidth. The substantial speedup of the logical simulation with the 
25 APPLES architecture is attained resulting in a gate processing rate of a few machine 
cycles. 

In this specification, the terms "comprise", "comprises" and "comprising" are used 
interchangeably with the terms "include", "includes" and "including", and are to be 
30 afforded the widest possible interpretation and vice versa. 



The invention is not limited to the embodiments hereinbefore described which may be 
varied in both construction and detail within the scope of the claims. 
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A parallel processing method of logic simulation comprising representing 
signals on a line over a time period as a bit sequence, evaluating the output of 
any logic gate including an evaluation of any inherent delay by a comparison 
between the bit sequences of its inputs to a predetermined series of bit 
patterns and in which those logic gates whose outputs have changed over the 
time period are identified during the evaluation of the gate outputs as real 
gate changes and only those real gate changes are propagated to fan out 
gates and in which the control of the method is carried out in an associative 
memory mechanism which stores in word form a history of gate input signals 
by compiling a hit list register of logic gate state changes and using a multiple 
response resolver forming part of the associative memory mechanism which 
generates an address for each hit, and then scans and transfers the results 
on the hit list to an output register for subsequent use characterised in that 
the hit list is segmented into a plurality of separate smaller hit lists each 
connected to a separate scan register and in which each scan register is 
operated in parallel to transfer the results to the output register. 

A method as claimed in claim 1 in which the associative register is divided into 
separate smaller associative sub-registers, one type of logic gate being 
allocated to each sub-register, each of which associative sub-registers has 
corresponding sub-registers connected thereto whereby gate evaluations and 
tests are carried out in parallel on each associative sub-register. 

A method as claimed in claim 1 or 2 in which each associative sub-register is 
used to form a hit list connected to a corresponding separate scan register. 

A method as claimed in any of claims 1 to 3 in which where the number of the 
one type of logic gate exceeds a predetermined number more than one sub- 
register is used. 

A method as claimed in any preceding claim in which the scan registers are 
controlled by exception logic using an OR gate whereby the scan is 
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terminated for each register on the OR gate changing state thus indicating no 
further matches. 

A method as claimed in claim 5 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

checking if the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

repeating the above steps until the hit list is cleared. 

A method as claimed in any preceding claim, in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of 
one time period, the aggregate bits representing the delay between signal 
output to and reception by the target logic gate. 

A method as claimed in any preceding claim, in which each delay is stored as 
a delay word in an associative memory forming part of the associative 
memory mechanism in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained within the 
delay word is calculated as a gate state; 
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the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register with 
those delay words whose widths did not exceed the associative 
5 register word width; and 

on the count of the associative register commencing:- 

the state register is consulted for the delay word entered in the state register 
1 0 and the remainder is ignored for this count of the associative register; 

at the end of the count of the associative register, the state register is 
updated; and 

1 5 the count continues until the remainder represents the count still required. 

9. A method as claimed in any preceding claim in which there is"an initialisation 
phase in which: 

20 specified signal values are inputted; 

unspecified signal values are set to unknown; 

test templates are prepared defining the delay model for each logic 
25 gate; 

the input circuit is parsed to generate an equivalent circuit consisting 
of 2-input logic gates; and 

30 the 2-input logic gates are then configured. 



10. 



A method as claimed in any preceding claim in which a multi-valued logic is 
applied and in which n bits are used to represent a signal value at any 
instance in time with n being any arbitrarily chosen logic. 
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11. A method as claimed in claim 1 0 in which an 8-valued logic is used where 000 
represents logic 0, 1 1 1 represents logic 1 and 001 to 1 10 represent arbitrarily 
defined other signal states. 

.5 

12. A method as claimed in claim 10 or 1 1 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 
memory mechanism. 

10 1 3. A method as claimed in any preceding claim in which there is stored a record 
of all values that a logic gate has acquired for the units of delay of the longest 
delay in the circuit. 

14. A parallel processing method of logic simulation comprising representing 

1 5 signals on a line over a time period as a bit sequence, evaluating the output of 

any logic gate including an evaluation of any inherent delay by a comparison 
between the bit sequences of its inputs to a predetermined series of bit 
patterns and in which those logic gates whose outputs have changed over the 
time period are identified during the evaluation of the gate outputs as real 

20 gate changes and only those real gate changes are propagated to fan out 

gates and in which the control of the method is carried out irvan associative 
memory mechanism which stores in word form a history of gate input signals 
by compiling a hit list register of logic gate state changes and using a multiple 
response resolver forming part of the associative memory mechanism which 

25 generates an address for each hit, and then scans and transfers the results 

on the hit list to an output register for subsequent use characterised in that the 
associative register is divided into separate smaller associative sub-registers, 
one type of logic gate being allocated to each associative sub-register, each 
of which associative sub-registers has corresponding sub-registers connected 

30 thereto whereby gate evaluations and tests are carried out in parallel on each 

associative sub-register. 



15. 



A method as claimed in claim 1 in which the hit list is segmented into a 
plurality of separate smaller hit lists corresponding to each associative sub- 
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register each smaller hit list is connected to a separate scan register and in 
which each scan register is operated in parallel to transfer the results to the 
output register. 

5 16. A method as claimed in claim 14 or 15 in which where the number of the one 
type of logic exceeds a predetermined number more than one sub-register is 
used. 

17. A method as claimed in claim 16 in which the scan registers are controlled by 
10 exception logic using an OR gate whereby the scan is terminated for each 

register on the OR gate changing state thus indicating no further matches. 

18. A method as claimed in claim 17 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

15 

checking if the bit is set indicating a hit; 
if a hit, determining the address effected by that hit; 
20 storing the address; 

clearing the bit in the hit list; 
moving to the next position in the hit list; and 
repeating the above steps until the hit list is cleared. 



25 



19. A method as claimed in any of claims 14 to 18, in which each line signal to a 
target logic gate is stored as a plurality of bits each representing a delay of 
30 one time period, the aggregate bits representing the delay between signal 

output to and reception by the target logic gate. 



A method as claimed in any of claims 14 to 19, in which each delay is stored 
as a delay word in an associative memory forming part of the associative 
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memory mechanism in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained 
within the delay word is calculated as a gate state; 

the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register 
with those delay words whose widths did not exceed the associative 
register word width; and 

on the count of the associative register commencing:- 

the state register is consulted for the delay word entered in the state 
register and the remainder is ignored for this count of the associative 
register; 

at the end of the count of the associative register, the state register is 
updated; and 

the count continues until the remainder represents the count still 
required. 

A method as claimed in any of claims 14 to 20 in which there is an 
initialisation phase in which: 

specified signal values are inputted; 

unspecified signal values are set to unknown; 
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test templates are prepared defining the delay model for each logic 
gate; 

the input circuit is parsed to generate an equivalent circuit consisting 
5 of 2-input logic gates; and 

the 2-input logic gates are then configured. 

22. A method as claimed in any of claims 14 to 21 in which a multi-valued logic is 
10 applied and in which n bits are used to represent a signal value at any 

instance in time with n being any arbitrarily chosen logic. 

23. A method as claimed in claim 22 in which an 8-vaIued logic is used where 000 
represents logic 0, 111 represents logic 1 and 001 to 1 10 represent arbitrarily 

1 5 defined other signal states. 

24. A method as claimed in claim 22 or 23 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 
memory mechanism. 

20 

25. A method as claimed in any of claims 14 to 24 in which there is stored a 
record of all values that a logic gate has acquired for the units of delay of the 
longest delay in the circuit. 



25 26. A parallel processing method of logic simulation comprising representing 
signals on a line over a time period as a bit sequence, evaluating the output of 
any logic gate by a comparison between the bit sequences of its inputs to a 
predetermined series of bit patterns and in which those logic gates whose 
outputs have changed over the time period are identified during the evaluation 

30 of the gate outputs as real gate changes and only those real gate changes 

are propagated to fan out gates and in which the control of the method is 
carried out in an associative memory mechanism which stores in word form a 
history of gate input signals by compiling a hit list register of logic gate state 
changes and using a multiple response resolver forming part of the 
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associative memory mechanism which generates an address for each hit, and 
then scans and transfers the results on the hit list to an output register for 
subsequent use characterised in that each line signal to a target logic gate is 
stored as a plurality of bits each representing a delay of one time period, the 
aggregate bits representing the delay between signal output to and reception 
by the target logic gate and in which the inherent delay of each logic gate is 
represented in the same manner. 

A method as claimed in claim 26, in which each delay is stored as a delay 
word in an associative memory forming part of the associative memory 
mechanism in which:- 

the length of the delay word is ascertained; and 

if the delay word width exceeds the associative register word width:- 

the number of integer multiples of the register word width contained 
within the delay word is calculated as a gate state; 

the gate state is stored in a further state register; 

the remainder from the calculation is stored in the associative register 
with those delay words whose widths did not exceed the associative 
register word width; and 

on the count of the associative register commencing: - 

the state register is consulted for the delay word entered in the state 
register and the remainder is ignored for this count of the associative 
register; 

at the end of the count of the associative register, the state register is 
updated; and 
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the count continues until the remainder represents the count still 
required. 

A method as claimed in claim 26 or 27, in which the hit list is segmented into 
a plurality of separate smaller hit lists each connected to a separate scan 
register and in which each scan register is operated in parallel to transfer the 
results to the output register. 

A method as claimed in any of claims 26 to 28, in which the scan registers are 
controlled by exception logic using an OR gate whereby the scan is 
terminated for each register on the OR gate changing state thus indicating no 
further matches. 

A method as claimed in claim 29 in which the scan is carried out by sequential 
counting through the hit list and the steps are performed of: 

checking rf the bit is set indicating a hit; 

if a hit, determining the address effected by that hit; 

storing the address; 

clearing the bit in the hit list; 

moving to the next position in the hit list; and 

repeating the above steps until the hit list is cleared. 

A method as claimed in any of claims 26 to 30 in which there is an 
initialisation phase in which: 

specified signal values are inputted; 

unspecified signal values are set to unknown; 
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test templates are prepared defining the delay model for each logic 
gate; 

5 the input circuit is parsed to generate an equivalent circuit consisting 

of 2-input logic gates; and 

the 2-input logic gates are then configured. 

10 32. A method as claimed in any of claims 26 to 31 in which a multi-valued logic is 
applied and in which n bits are used to represent a signal value at any 
instance in time with n being any arbitrarily chosen logic. 

33. A method as claimed in claim 32 in which an 8-valued logic is used where 000 
15 represents logic 0, 111 represents logic 1 and 001 to 110 represent arbitrarily 

defined other signal states. 

"N. 

34. A method as claimed in claim 32 or 33 in which the sequence of values on a 
logic gate is stored as a bit pattern forming a unique word in the associative 

20 memory mechanism. 

35. A method as claimed in any of claims 26 to 34 in which there is stored a 
record of all values that a logic gate has acquired for the units of delay of the 
longest delay in the circuit. 

25 

36. A parallel processor for logic event simulation (APPLES) comprising:- 

a main processor; 

30 an associative memory mechanism including a response resolver; 



characterised in that the associative memory mechanism comprises:- 
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a plurality of separate associative sub-registers each for 
the storage in word form of a history of gate input signals 
for a specified type of logic gate; and 

a plurality of separate additional sub-registers associated with each 
associative sub-register whereby gate evaluations and tests can be 
carried out in parallel on each associative sub-register. 

A processor as claimed in claim 36, in which the additional sub-registers 
comprise an input sub-register, a mask sub-register and a scan sub-register. 

A processor as claimed in claim 37, in which the scan sub-registers are 
connected to an output register. 
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